// ABOUT THE PROJECT

TruthLens β€” Research Project

A dual-task AI system for fake news detection and news popularity prediction, built using state-of-the-art NLP models and trained on benchmark datasets used by leading research institutions worldwide.

πŸ“Œ Project Overview

TruthLens is an end-to-end machine learning project that addresses two critical challenges in modern information ecosystems: detecting misinformation and predicting content virality. The system combines Natural Language Processing (NLP), deep learning, and ensemble methods to provide real-time analysis of news articles.

The project implements a dual-model pipeline: a Fake News Detection module using BERT, BiLSTM, and ensemble classifiers trained on the ISOT, LIAR, and FakeNewsNet datasets; and a Popularity Prediction module using XGBoost and Random Forest trained on the UCI Online News Popularity dataset with 58 engineered features.

πŸ”¬ Research Background

Fake news detection has emerged as one of the most critical NLP challenges of the decade. According to recent research (Nature Scientific Reports, 2024), transformer-based models like BERT consistently achieve the highest detection accuracy, often exceeding 99% on benchmark corpora, while deep learning RNNs such as LSTM achieve ~98% and GRU ~93%.

News popularity prediction leverages the UCI Online News Popularity dataset (Fernandes et al., 2015), which contains 39,797 articles from Mashable with 58 predictive features spanning content length, lexical diversity, sentiment scores, keyword density, and publication timing. Random Forest and XGBoost models achieve up to 88% accuracy on this task.

This project implements a hybrid approach combining both tasks in a unified Flask API, enabling simultaneous authenticity verification and virality forecasting for any given news article.

πŸ€– Machine Learning Models

Model Task Accuracy Framework Key Feature
BERT (bert-base-uncased) Fake Detection 99.9% HuggingFace Transformers Bidirectional contextual embeddings
BiLSTM + Attention Fake Detection 98.0% TensorFlow / Keras Sequential context, both directions
RoBERTa Fake Detection 99.5% HuggingFace Transformers Robustly optimized BERT pretraining
Logistic Regression + TF-IDF Fake Detection 98.7% Scikit-learn Fast, interpretable baseline
Random Forest Fake Detection + Popularity 96.5% Scikit-learn Ensemble, feature importance
XGBoost Popularity Prediction 88.0% XGBoost Gradient boosting, 58 features
Ensemble Voting (BERT+BiLSTM+RF) Fake Detection 99.5% Custom Soft voting, maximum robustness

πŸ“¦ Datasets Used

Dataset Size Task Source Labels
ISOT Fake News Dataset 44,898 articles Fake Detection Reuters + Flagged sites Real / Fake
LIAR Dataset 12,836 statements Fake Detection PolitiFact 6-class (pants-fire β†’ true)
FakeNewsNet (PolitiFact) ~12,000 articles Fake Detection PolitiFact + GossipCop Real / Fake + social context
COVID-19 Fake News ~10,000 articles Fake Detection Multiple sources Real / Fake
UCI Online News Popularity 39,797 articles Popularity Prediction Mashable (2013–2015) Share count (regression/classification)
Fakeddit 1,063,106 samples Multimodal Fake Detection Reddit 6-way veracity labels

βš™οΈ Feature Engineering (58 Features)

The popularity prediction model uses 58 engineered features grouped into 6 categories:

CategoryFeatures
Content Length n_tokens_title, n_tokens_content, n_unique_tokens, n_non_stop_words, n_non_stop_unique_tokens
Links & Media num_hrefs, num_self_hrefs, num_imgs, num_videos, average_token_length
Keywords & Topics num_keywords, data_channel_is_lifestyle, data_channel_is_entertainment, data_channel_is_bus, data_channel_is_socmed, data_channel_is_tech, data_channel_is_world
Sentiment & Subjectivity global_subjectivity, global_sentiment_polarity, global_rate_positive_words, global_rate_negative_words, title_subjectivity, title_sentiment_polarity, avg_positive_polarity, avg_negative_polarity
Publication Timing weekday_is_monday through weekday_is_sunday, is_weekend, timedelta
LDA Topics LDA_00 through LDA_04 (5 latent topic distributions)

πŸ› οΈ Technology Stack

LayerTechnologyPurpose
FrontendHTML5, CSS3, Vanilla JavaScriptUI, animations, interactive demo
BackendPython 3.11, Flask 3.xREST API, model serving, routing
NLP PreprocessingNLTK, spaCy, reTokenization, lemmatization, cleaning
Text VectorizationTF-IDF (sklearn), Word2Vec (gensim)Feature extraction for classical ML
Deep LearningTensorFlow 2.x / KerasLSTM, BiLSTM, CNN-LSTM models
TransformersHuggingFace Transformers, PyTorchBERT, RoBERTa fine-tuning
Classical MLScikit-learnLR, RF, SVM, Naive Bayes
Gradient BoostingXGBoost, LightGBMPopularity prediction
ExplainabilitySHAPFeature importance, model explanations
Data ProcessingPandas, NumPyData manipulation, feature engineering
VisualizationMatplotlib, SeabornTraining plots, confusion matrices
Model PersistenceJoblib, PickleSaving/loading trained models

πŸ”„ NLP Preprocessing Pipeline

Every input text goes through a standardized preprocessing pipeline before model inference:

StepOperationTool
1. Text CleaningRemove HTML tags, URLs, special characters, extra whitespacere (regex)
2. LowercasingConvert all text to lowercase for uniformityPython str
3. TokenizationSplit text into individual word tokensNLTK word_tokenize
4. Stop Word RemovalRemove common words (the, is, at, which) that add noiseNLTK stopwords
5. LemmatizationReduce words to base form (running β†’ run)NLTK WordNetLemmatizer
6. VectorizationConvert tokens to numerical features (TF-IDF or embeddings)sklearn / HuggingFace
7. Padding/TruncationNormalize sequence length for deep learning modelsKeras pad_sequences

πŸ“š Key References

[1] Fernandes, K., Vinagre, P., Cortez, P. (2015). A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. UCI ML Repository.
[2] Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H. (2017). Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD.
[3] Devlin, J., Chang, M., Lee, K., Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL.
[4] Ahmed, H., Traore, I., Saad, S. (2018). Detecting Opinion Spams and Fake News Using Text Classification. Security and Privacy.
[5] Truică, C., Apostol, E. (2024). Ensemble based high performance deep learning models for fake news detection. Nature Scientific Reports.
[6] Choudhary, A., Arora, A. (2024). LSTM, GRU, and BERT Models for Fake News Detection. Comparative Analysis.
[7] Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H. (2020). FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information. Big Data.

← Back to Analyzer