A dual-task AI system for fake news detection and news popularity prediction, built using state-of-the-art NLP models and trained on benchmark datasets used by leading research institutions worldwide.
TruthLens is an end-to-end machine learning project that addresses two critical challenges in modern information ecosystems: detecting misinformation and predicting content virality. The system combines Natural Language Processing (NLP), deep learning, and ensemble methods to provide real-time analysis of news articles.
The project implements a dual-model pipeline: a Fake News Detection module using BERT, BiLSTM, and ensemble classifiers trained on the ISOT, LIAR, and FakeNewsNet datasets; and a Popularity Prediction module using XGBoost and Random Forest trained on the UCI Online News Popularity dataset with 58 engineered features.
Fake news detection has emerged as one of the most critical NLP challenges of the decade. According to recent research (Nature Scientific Reports, 2024), transformer-based models like BERT consistently achieve the highest detection accuracy, often exceeding 99% on benchmark corpora, while deep learning RNNs such as LSTM achieve ~98% and GRU ~93%.
News popularity prediction leverages the UCI Online News Popularity dataset (Fernandes et al., 2015), which contains 39,797 articles from Mashable with 58 predictive features spanning content length, lexical diversity, sentiment scores, keyword density, and publication timing. Random Forest and XGBoost models achieve up to 88% accuracy on this task.
This project implements a hybrid approach combining both tasks in a unified Flask API, enabling simultaneous authenticity verification and virality forecasting for any given news article.
| Model | Task | Accuracy | Framework | Key Feature |
|---|---|---|---|---|
| BERT (bert-base-uncased) | Fake Detection | 99.9% | HuggingFace Transformers | Bidirectional contextual embeddings |
| BiLSTM + Attention | Fake Detection | 98.0% | TensorFlow / Keras | Sequential context, both directions |
| RoBERTa | Fake Detection | 99.5% | HuggingFace Transformers | Robustly optimized BERT pretraining |
| Logistic Regression + TF-IDF | Fake Detection | 98.7% | Scikit-learn | Fast, interpretable baseline |
| Random Forest | Fake Detection + Popularity | 96.5% | Scikit-learn | Ensemble, feature importance |
| XGBoost | Popularity Prediction | 88.0% | XGBoost | Gradient boosting, 58 features |
| Ensemble Voting (BERT+BiLSTM+RF) | Fake Detection | 99.5% | Custom | Soft voting, maximum robustness |
| Dataset | Size | Task | Source | Labels |
|---|---|---|---|---|
| ISOT Fake News Dataset | 44,898 articles | Fake Detection | Reuters + Flagged sites | Real / Fake |
| LIAR Dataset | 12,836 statements | Fake Detection | PolitiFact | 6-class (pants-fire β true) |
| FakeNewsNet (PolitiFact) | ~12,000 articles | Fake Detection | PolitiFact + GossipCop | Real / Fake + social context |
| COVID-19 Fake News | ~10,000 articles | Fake Detection | Multiple sources | Real / Fake |
| UCI Online News Popularity | 39,797 articles | Popularity Prediction | Mashable (2013β2015) | Share count (regression/classification) |
| Fakeddit | 1,063,106 samples | Multimodal Fake Detection | 6-way veracity labels |
The popularity prediction model uses 58 engineered features grouped into 6 categories:
| Category | Features |
|---|---|
| Content Length | n_tokens_title, n_tokens_content, n_unique_tokens, n_non_stop_words, n_non_stop_unique_tokens |
| Links & Media | num_hrefs, num_self_hrefs, num_imgs, num_videos, average_token_length |
| Keywords & Topics | num_keywords, data_channel_is_lifestyle, data_channel_is_entertainment, data_channel_is_bus, data_channel_is_socmed, data_channel_is_tech, data_channel_is_world |
| Sentiment & Subjectivity | global_subjectivity, global_sentiment_polarity, global_rate_positive_words, global_rate_negative_words, title_subjectivity, title_sentiment_polarity, avg_positive_polarity, avg_negative_polarity |
| Publication Timing | weekday_is_monday through weekday_is_sunday, is_weekend, timedelta |
| LDA Topics | LDA_00 through LDA_04 (5 latent topic distributions) |
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | HTML5, CSS3, Vanilla JavaScript | UI, animations, interactive demo |
| Backend | Python 3.11, Flask 3.x | REST API, model serving, routing |
| NLP Preprocessing | NLTK, spaCy, re | Tokenization, lemmatization, cleaning |
| Text Vectorization | TF-IDF (sklearn), Word2Vec (gensim) | Feature extraction for classical ML |
| Deep Learning | TensorFlow 2.x / Keras | LSTM, BiLSTM, CNN-LSTM models |
| Transformers | HuggingFace Transformers, PyTorch | BERT, RoBERTa fine-tuning |
| Classical ML | Scikit-learn | LR, RF, SVM, Naive Bayes |
| Gradient Boosting | XGBoost, LightGBM | Popularity prediction |
| Explainability | SHAP | Feature importance, model explanations |
| Data Processing | Pandas, NumPy | Data manipulation, feature engineering |
| Visualization | Matplotlib, Seaborn | Training plots, confusion matrices |
| Model Persistence | Joblib, Pickle | Saving/loading trained models |
Every input text goes through a standardized preprocessing pipeline before model inference:
| Step | Operation | Tool |
|---|---|---|
| 1. Text Cleaning | Remove HTML tags, URLs, special characters, extra whitespace | re (regex) |
| 2. Lowercasing | Convert all text to lowercase for uniformity | Python str |
| 3. Tokenization | Split text into individual word tokens | NLTK word_tokenize |
| 4. Stop Word Removal | Remove common words (the, is, at, which) that add noise | NLTK stopwords |
| 5. Lemmatization | Reduce words to base form (running β run) | NLTK WordNetLemmatizer |
| 6. Vectorization | Convert tokens to numerical features (TF-IDF or embeddings) | sklearn / HuggingFace |
| 7. Padding/Truncation | Normalize sequence length for deep learning models | Keras pad_sequences |
[1] Fernandes, K., Vinagre, P., Cortez, P. (2015). A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. UCI ML Repository.
[2] Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H. (2017). Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD.
[3] Devlin, J., Chang, M., Lee, K., Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL.
[4] Ahmed, H., Traore, I., Saad, S. (2018). Detecting Opinion Spams and Fake News Using Text Classification. Security and Privacy.
[5] TruicΔ, C., Apostol, E. (2024). Ensemble based high performance deep learning models for fake news detection. Nature Scientific Reports.
[6] Choudhary, A., Arora, A. (2024). LSTM, GRU, and BERT Models for Fake News Detection. Comparative Analysis.
[7] Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H. (2020). FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information. Big Data.