🎯 Content Aggregation System

WhatsApp Group Link Aggregator & Semantic Clustering

Implementation Plan v1.0 | Generated by Jedibot AI

πŸ“‹ Executive Summary

Sistem ini adalah platform agregasi konten cerdas yang mengumpulkan link dari grup WhatsApp, mengekstrak metadata dan transkrip, mengelompokkan berdasarkan topik (semantic clustering), dan menghasilkan ringkasan otomatis.

🎯 Core Value: Mengubah noise WhatsApp (puluhan link per hari) menjadi structured insights yang bisa dikonsumsi dalam 5 menit.

Fitur Utama:

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ CONTENT AGGREGATION SYSTEM β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ DATA SOURCES │───▢│ INGESTION │───▢│ PROCESSING β”‚ β”‚ β”‚ β”‚ LAYER β”‚ β”‚ LAYER β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β€’ WhatsApp API β”‚ β”‚ β€’ WhatsApp Bot β”‚ β”‚ β€’ Metadata β”‚ β”‚ β€’ Webhooks β”‚ β”‚ β€’ Link Parser β”‚ β”‚ Extractor β”‚ β”‚ β€’ Manual Input β”‚ β”‚ β€’ Queue Manager β”‚ β”‚ β€’ S2T (Whisper) β”‚ β”‚ β€’ RSS Feeds β”‚ β”‚ β€’ Rate Limiter β”‚ β”‚ β€’ OCR (Tesseract)β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ OUTPUT LAYER │◀───│ CLUSTERING │◀───│ STORAGE β”‚ β”‚ β”‚ β”‚ & AI β”‚ β”‚ LAYER β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β€’ HTML Dashboardβ”‚ β”‚ β€’ Embedding β”‚ β”‚ β€’ PostgreSQL β”‚ β”‚ β€’ JSON API β”‚ β”‚ (OpenAI) β”‚ β”‚ β€’ Redis Cache β”‚ β”‚ β€’ RSS Feed β”‚ β”‚ β€’ Topic Model β”‚ β”‚ β€’ S3/Object β”‚ β”‚ β€’ Notifications β”‚ β”‚ β€’ LLM Summary β”‚ β”‚ β€’ Vector DB β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow:

Step 1: Ingestion πŸ”„

WhatsApp bot monitor grup β†’ Capture link β†’ Validasi URL β†’ Masuk queue (Redis/RabbitMQ)

Step 2: Extraction πŸ”

Worker proses queue β†’ Scrape metadata (Open Graph, JSON-LD) β†’ Download media β†’ Speech-to-text untuk video β†’ OCR untuk gambar

Step 3: Embedding & Clustering 🧠

Generate text embedding (OpenAI/Local) β†’ Semantic similarity β†’ Hierarchical clustering β†’ Topic labeling otomatis

Step 4: Summarization πŸ“

LLM (Qwen/GPT-4) baca cluster β†’ Generate ringkasan β†’ Extract key insights β†’ Sentiment analysis

Step 5: Presentation πŸ“Š

Update dashboard HTML β†’ Generate RSS β†’ Kirim notifikasi (Telegram/Email) β†’ Archive ke database

πŸ› οΈ Technology Stack

🎭 Frontend

  • Next.js 14 (App Router)
  • Tailwind CSS + shadcn/ui
  • React Query (data fetching)
  • D3.js / Recharts (visualisasi)
  • PWA support (mobile)

βš™οΈ Backend

  • Node.js + Express / Fastify
  • Python + FastAPI (ML services)
  • Bull Queue (Redis) for jobs
  • WebSocket (real-time updates)
  • Bull Board (queue monitoring)

πŸ—„οΈ Database & Storage

  • PostgreSQL (relational data)
  • Redis (cache & queue)
  • Pinecone/Chroma (vector DB)
  • MinIO/S3 (media storage)
  • Elasticsearch (full-text search)

πŸ€– AI / ML

  • OpenAI API (embeddings, summary)
  • Whisper API (speech-to-text)
  • Qwen (local/cheap summarization)
  • Hugging Face (topic modeling)
  • yt-dlp, gallery-dl (scrapers)

πŸ“± WhatsApp Integration

  • whatsmeow (Go library)
  • Baileys (Node.js - alt)
  • WhatsApp Business API (official)
  • Webhook server (Express)
  • QR code pairing

πŸš€ DevOps

  • Docker + Docker Compose
  • Nginx (reverse proxy)
  • PM2 (process manager)
  • Cron jobs (scheduled tasks)
  • Grafana + Prometheus (monitoring)

πŸ“ Database Schema

-- Links Table (Raw content) CREATE TABLE links ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), url TEXT NOT NULL UNIQUE, platform VARCHAR(20) NOT NULL, -- instagram, twitter, tiktok, facebook source_chat_id VARCHAR(50), source_message_id VARCHAR(100), shared_by VARCHAR(100), shared_at TIMESTAMP DEFAULT NOW(), status VARCHAR(20) DEFAULT 'pending', -- pending, processing, completed, failed created_at TIMESTAMP DEFAULT NOW() ); -- Content Table (Extracted metadata) CREATE TABLE contents ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), link_id UUID REFERENCES links(id), title TEXT, description TEXT, author VARCHAR(100), published_at TIMESTAMP, media_type VARCHAR(20), -- video, image, article media_url TEXT, thumbnail_url TEXT, transcript TEXT, -- S2T result raw_metadata JSONB, extracted_at TIMESTAMP DEFAULT NOW() ); -- Clusters Table (Topic groups) CREATE TABLE clusters ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name VARCHAR(200) NOT NULL, description TEXT, topic_vector vector(1536), -- OpenAI embedding content_count INTEGER DEFAULT 0, created_at TIMESTAMP DEFAULT NOW(), updated_at TIMESTAMP DEFAULT NOW() ); -- Cluster-Content Junction CREATE TABLE cluster_contents ( cluster_id UUID REFERENCES clusters(id), content_id UUID REFERENCES contents(id), similarity_score FLOAT, PRIMARY KEY (cluster_id, content_id) ); -- Summaries Table (AI-generated) CREATE TABLE summaries ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), cluster_id UUID REFERENCES clusters(id), summary_text TEXT NOT NULL, key_insights JSONB, sentiment VARCHAR(20), -- positive, negative, neutral generated_by VARCHAR(50), -- model name generated_at TIMESTAMP DEFAULT NOW() ); -- Feed/Dashboard Cache CREATE TABLE feed_cache ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), feed_type VARCHAR(50), -- timeline, topic, trending content JSONB NOT NULL, generated_at TIMESTAMP DEFAULT NOW(), expires_at TIMESTAMP );

πŸ”Œ Platform-Specific Extraction

Platform Library Data Extracted S2T/OCR
πŸ“Έ Instagram instaloader, gallery-dl Caption, hashtags, likes, comments Video: Whisper
🐦 Twitter/X ntscraper, snscrape Tweet text, media, engagement Video: Whisper
🎡 TikTok yt-dlp (tiktok extract), TikTok-Api Caption, hashtags, music, stats Video: Whisper (wajib)
πŸ“˜ Facebook facebook-scraper Post text, reactions, shares Video: Whisper
▢️ YouTube yt-dlp, youtube-transcript-api Title, description, captions Caption/Whisper

🧠 Semantic Clustering Algorithm

Approach: Hierarchical Clustering dengan Embeddings

  1. Text Preprocessing: Clean transcript + caption + description
  2. Embedding Generation: OpenAI text-embedding-3-small (cheap) atau local model
  3. Similarity Matrix: Cosine similarity antar content
  4. Hierarchical Clustering: Agglomerative clustering dengan threshold dinamis
  5. Topic Labeling: LLM baca cluster β†’ generate nama topik otomatis
  6. Re-clustering: Incremental clustering untuk content baru (tidak recompute semua)
// Pseudo-code clustering async function clusterContent(newContent) { // 1. Generate embedding const embedding = await openai.embeddings.create({ model: "text-embedding-3-small", input: newContent.combinedText }); // 2. Find similar clusters const similarClusters = await db.query(` SELECT cluster_id, 1 - (topic_vector <=> $1) as similarity FROM clusters WHERE 1 - (topic_vector <=> $1) > 0.85 ORDER BY similarity DESC LIMIT 5 `, [embedding]); // 3. Assign to cluster or create new if (similarClusters.length > 0 && similarClusters[0].similarity > 0.90) { await assignToCluster(newContent.id, similarClusters[0].cluster_id); } else { await createNewCluster(newContent, embedding); } }

πŸ“Š Dashboard Views

1. Timeline View πŸ“…

Stream kronologis semua link yang masuk, bisa filter by platform, date range, keyword.

2. Topic/Cluster View πŸ—‚οΈ

Konten dikelompokkan per topik dengan:

3. Summary Feed πŸ“

Ringkasan harian/mingguan:

4. Search & Discovery πŸ”

πŸš€ Implementation Phases

Phase Duration Deliverables Priority
Phase 1: MVP 2-3 minggu WhatsApp bot + basic extraction + simple dashboard HIGH
Phase 2: Clustering 2 minggu Embedding + clustering + topic labeling HIGH
Phase 3: AI Summary 1-2 minggu Auto-summary + insights + sentiment MEDIUM
Phase 4: Advanced 2 minggu Search, notifications, PWA, analytics LOW

πŸ’° Cost Estimation

Infrastructure (VPS)

API Costs (per 1000 links)

πŸ’‘ Cost Optimization Tips:
  • Pakai Qwen untuk summary (1 juta token free!)
  • Local Whisper model (higher upfront, free usage)
  • Cache embeddings (jangan regenerate)
  • Lazy loading untuk media

⚠️ Challenges & Solutions

Challenge Solution
Rate limiting (Instagram, TikTok) Rotating proxies, exponential backoff, respect robots.txt
Private/deleted content Graceful degradation, cache important metadata immediately
Multi-language content Language detection, translate ke English untuk embedding
Video processing lambat Async queue, parallel processing, cache transcripts
Spam/low-quality content Quality scoring, minimum engagement threshold

🎬 Quick Start (MVP)

# 1. Clone & Setup git clone https://github.com/yourname/content-aggregator.git cd content-aggregator cp .env.example .env # Edit .env dengan API keys # 2. Start with Docker docker-compose up -d # 3. Setup WhatsApp npm run whatsapp:pair # Scan QR code # 4. Access dashboard open http://localhost:3000

Minimum Viable Product Features:

Start simple, iterate fast! πŸš€