Social Sentiment Pipeline
End-to-end data engineering pipeline extracting Reddit posts/comments, performing sentiment analysis, and storing results with visualization dashboard
Project Overview
End-to-end data engineering pipeline extracting Reddit posts/comments, performing sentiment analysis, and storing results with visualization dashboard
Key Features
Reddit API Integration
PRAW-based data extraction with keyword search, focusing on comment analysis for higher accuracy
VADER Sentiment Analysis
Real-time sentiment scoring with positive, negative, neutral, and compound metrics
Text Cleaning Pipeline
Multi-step processing to remove URLs, markdown, and noise for accurate sentiment analysis
PostgreSQL Storage
Three interconnected tables tracking search queries, posts, and analyzed comments with full sentiment metrics
Streamlit Dashboard
Interactive visualization for exploring sentiment trends and insights from Reddit data
Rate Limit Handling
Graceful API rate limit management ensuring reliable data collection without service interruptions
Impact & Highlights
End-to-End Pipeline
Complete data engineering lifecycle from Reddit API to PostgreSQL to Streamlit visualization
Production-Ready Schema
Scalable database design supporting complex queries and future expansion
Iterative Improvement
Pivoted from post analysis to comment analysis for higher quality sentiment results
README.md
Project Overview
A comprehensive data engineering pipeline that extracts Reddit posts and comments based on keyword searches, analyzes their sentiment using VADER, and stores structured results in PostgreSQL with a Streamlit dashboard for visualization.
Architecture
The pipeline follows a modular design with distinct stages:
- Ingestion: Extract posts and comments from Reddit using PRAW
- Processing: Analyze sentiment and clean text data
- Storage: Store structured results in PostgreSQL
- Visualization: Interactive Streamlit dashboard for insights
Key Features
Data Collection
- Reddit API integration via PRAW
- Keyword-based search queries
- Graceful handling of API rate limiting
- Focus on comment analysis for better accuracy
Text Processing
- Multi-step text cleaning pipeline
- URL and markdown removal
- Noise reduction for accurate sentiment analysis
Sentiment Analysis
- VADER sentiment scoring
- Positive, negative, neutral, and compound metrics
- Real-time analysis pipeline
Data Storage
- PostgreSQL database with three interconnected tables
- Tracks search queries, posts, and analyzed comments
- Detailed sentiment metrics storage
Technical Implementation
Technologies Used
- Python 3.9+: Core development language
- PRAW: Python Reddit API Wrapper for data collection
- VADER: Sentiment analysis engine
- PostgreSQL: Relational database for structured storage
- Pandas: Data manipulation and processing
- Pydantic: Data validation and schema enforcement
- Streamlit: Interactive dashboard and visualization
Challenges Overcome
- Data Quality: Shifted from analyzing top posts to top comments after discovering posts yielded noisy, meaningless results
- API Management: Implemented graceful handling of Reddit API rate limiting
- Text Cleaning: Built robust multi-step pipeline to handle various text formats and noise
Impact
- Built complete data engineering lifecycle from API to visualization
- Achieved accurate sentiment analysis through iterative improvement
- Created production-ready database schema for scalable storage