About — TruthLens Fake News Detector

📌 Project Overview

TruthLens is an end-to-end machine learning project that addresses two critical challenges in modern information ecosystems: detecting misinformation and predicting content virality. The system combines Natural Language Processing (NLP), deep learning, and ensemble methods to provide real-time analysis of news articles.

The project implements a dual-model pipeline: a Fake News Detection module using BERT, BiLSTM, and ensemble classifiers trained on the ISOT, LIAR, and FakeNewsNet datasets; and a Popularity Prediction module using XGBoost and Random Forest trained on the UCI Online News Popularity dataset with 58 engineered features.

🔬 Research Background

Fake news detection has emerged as one of the most critical NLP challenges of the decade. According to recent research (Nature Scientific Reports, 2024), transformer-based models like BERT consistently achieve the highest detection accuracy, often exceeding 99% on benchmark corpora, while deep learning RNNs such as LSTM achieve ~98% and GRU ~93%.

News popularity prediction leverages the UCI Online News Popularity dataset (Fernandes et al., 2015), which contains 39,797 articles from Mashable with 58 predictive features spanning content length, lexical diversity, sentiment scores, keyword density, and publication timing. Random Forest and XGBoost models achieve up to 88% accuracy on this task.

This project implements a hybrid approach combining both tasks in a unified Flask API, enabling simultaneous authenticity verification and virality forecasting for any given news article.

🤖 Machine Learning Models

Model	Task	Accuracy	Framework	Key Feature
BERT (bert-base-uncased)	Fake Detection	99.9%	HuggingFace Transformers	Bidirectional contextual embeddings
BiLSTM + Attention	Fake Detection	98.0%	TensorFlow / Keras	Sequential context, both directions
RoBERTa	Fake Detection	99.5%	HuggingFace Transformers	Robustly optimized BERT pretraining
Logistic Regression + TF-IDF	Fake Detection	98.7%	Scikit-learn	Fast, interpretable baseline
Random Forest	Fake Detection + Popularity	96.5%	Scikit-learn	Ensemble, feature importance
XGBoost	Popularity Prediction	88.0%	XGBoost	Gradient boosting, 58 features
Ensemble Voting (BERT+BiLSTM+RF)	Fake Detection	99.5%	Custom	Soft voting, maximum robustness

📦 Datasets Used

Dataset	Size	Task	Source	Labels
ISOT Fake News Dataset	44,898 articles	Fake Detection	Reuters + Flagged sites	Real / Fake
LIAR Dataset	12,836 statements	Fake Detection	PolitiFact	6-class (pants-fire → true)
FakeNewsNet (PolitiFact)	~12,000 articles	Fake Detection	PolitiFact + GossipCop	Real / Fake + social context
COVID-19 Fake News	~10,000 articles	Fake Detection	Multiple sources	Real / Fake
UCI Online News Popularity	39,797 articles	Popularity Prediction	Mashable (2013–2015)	Share count (regression/classification)
Fakeddit	1,063,106 samples	Multimodal Fake Detection	Reddit	6-way veracity labels

⚙️ Feature Engineering (58 Features)

The popularity prediction model uses 58 engineered features grouped into 6 categories:

Category	Features
Content Length	n_tokens_title, n_tokens_content, n_unique_tokens, n_non_stop_words, n_non_stop_unique_tokens
Links & Media	num_hrefs, num_self_hrefs, num_imgs, num_videos, average_token_length
Keywords & Topics	num_keywords, data_channel_is_lifestyle, data_channel_is_entertainment, data_channel_is_bus, data_channel_is_socmed, data_channel_is_tech, data_channel_is_world
Sentiment & Subjectivity	global_subjectivity, global_sentiment_polarity, global_rate_positive_words, global_rate_negative_words, title_subjectivity, title_sentiment_polarity, avg_positive_polarity, avg_negative_polarity
Publication Timing	weekday_is_monday through weekday_is_sunday, is_weekend, timedelta
LDA Topics	LDA_00 through LDA_04 (5 latent topic distributions)

🛠️ Technology Stack

Layer	Technology	Purpose
Frontend	HTML5, CSS3, Vanilla JavaScript	UI, animations, interactive demo
Backend	Python 3.11, Flask 3.x	REST API, model serving, routing
NLP Preprocessing	NLTK, spaCy, re	Tokenization, lemmatization, cleaning
Text Vectorization	TF-IDF (sklearn), Word2Vec (gensim)	Feature extraction for classical ML
Deep Learning	TensorFlow 2.x / Keras	LSTM, BiLSTM, CNN-LSTM models
Transformers	HuggingFace Transformers, PyTorch	BERT, RoBERTa fine-tuning
Classical ML	Scikit-learn	LR, RF, SVM, Naive Bayes
Gradient Boosting	XGBoost, LightGBM	Popularity prediction
Explainability	SHAP	Feature importance, model explanations
Data Processing	Pandas, NumPy	Data manipulation, feature engineering
Visualization	Matplotlib, Seaborn	Training plots, confusion matrices
Model Persistence	Joblib, Pickle	Saving/loading trained models

🔄 NLP Preprocessing Pipeline

Every input text goes through a standardized preprocessing pipeline before model inference:

Step	Operation	Tool
1. Text Cleaning	Remove HTML tags, URLs, special characters, extra whitespace	re (regex)
2. Lowercasing	Convert all text to lowercase for uniformity	Python str
3. Tokenization	Split text into individual word tokens	NLTK word_tokenize
4. Stop Word Removal	Remove common words (the, is, at, which) that add noise	NLTK stopwords
5. Lemmatization	Reduce words to base form (running → run)	NLTK WordNetLemmatizer
6. Vectorization	Convert tokens to numerical features (TF-IDF or embeddings)	sklearn / HuggingFace
7. Padding/Truncation	Normalize sequence length for deep learning models	Keras pad_sequences

📚 Key References

[1] Fernandes, K., Vinagre, P., Cortez, P. (2015). A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. UCI ML Repository.
[2] Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H. (2017). Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD.
[3] Devlin, J., Chang, M., Lee, K., Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL.
[4] Ahmed, H., Traore, I., Saad, S. (2018). Detecting Opinion Spams and Fake News Using Text Classification. Security and Privacy.
[5] Truică, C., Apostol, E. (2024). Ensemble based high performance deep learning models for fake news detection. Nature Scientific Reports.
[6] Choudhary, A., Arora, A. (2024). LSTM, GRU, and BERT Models for Fake News Detection. Comparative Analysis.
[7] Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H. (2020). FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information. Big Data.

TruthLens — Research Project