Content Aggregation System - Implementation Plan

📋 Executive Summary

Sistem ini adalah platform agregasi konten cerdas yang mengumpulkan link dari grup WhatsApp, mengekstrak metadata dan transkrip, mengelompokkan berdasarkan topik (semantic clustering), dan menghasilkan ringkasan otomatis.

                🎯 Core Value: Mengubah noise WhatsApp (puluhan link per hari) menjadi structured insights yang bisa dikonsumsi dalam 5 menit.
            

Fitur Utama:

📥 Auto-Ingestion: Monitoring link real-time dari WhatsApp
🔍 Metadata Extraction: Caption, deskripsi, thumbnail, engagement stats
🎙️ Speech-to-Text: Transkrip audio/video otomatis
🧠 Semantic Clustering: Topic modeling lintas platform
📝 Auto-Summary: Ringkasan per topik dengan AI
📊 Structured Output: Dashboard HTML, timeline view, feed ringkasan

🏗️ System Architecture

┌─────────────────────────────────────────────────────────────────┐ │ CONTENT AGGREGATION SYSTEM │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ DATA SOURCES │───▶│ INGESTION │───▶│ PROCESSING │ │ │ │ LAYER │ │ LAYER │ ├─────────────────┤ ├──────────────────┤ ├──────────────────┤ │ • WhatsApp API │ │ • WhatsApp Bot │ │ • Metadata │ │ • Webhooks │ │ • Link Parser │ │ Extractor │ │ • Manual Input │ │ • Queue Manager │ │ • S2T (Whisper) │ │ • RSS Feeds │ │ • Rate Limiter │ │ • OCR (Tesseract)│ └─────────────────┘ └──────────────────┘ └──────────────────┘ │ ▼ ┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ OUTPUT LAYER │◀───│ CLUSTERING │◀───│ STORAGE │ │ │ │ & AI │ │ LAYER │ ├─────────────────┤ ├──────────────────┤ ├──────────────────┤ │ • HTML Dashboard│ │ • Embedding │ │ • PostgreSQL │ │ • JSON API │ │ (OpenAI) │ │ • Redis Cache │ │ • RSS Feed │ │ • Topic Model │ │ • S3/Object │ │ • Notifications │ │ • LLM Summary │ │ • Vector DB │ └─────────────────┘ └──────────────────┘ └──────────────────┘

Data Flow:

Step 1: Ingestion 🔄

WhatsApp bot monitor grup → Capture link → Validasi URL → Masuk queue (Redis/RabbitMQ)

Step 2: Extraction 🔍

Worker proses queue → Scrape metadata (Open Graph, JSON-LD) → Download media → Speech-to-text untuk video → OCR untuk gambar

Step 3: Embedding & Clustering 🧠

Generate text embedding (OpenAI/Local) → Semantic similarity → Hierarchical clustering → Topic labeling otomatis

Step 4: Summarization 📝

LLM (Qwen/GPT-4) baca cluster → Generate ringkasan → Extract key insights → Sentiment analysis

Step 5: Presentation 📊

Update dashboard HTML → Generate RSS → Kirim notifikasi (Telegram/Email) → Archive ke database

🛠️ Technology Stack

🎭 Frontend

Next.js 14 (App Router)
Tailwind CSS + shadcn/ui
React Query (data fetching)
D3.js / Recharts (visualisasi)
PWA support (mobile)

⚙️ Backend

Node.js + Express / Fastify
Python + FastAPI (ML services)
Bull Queue (Redis) for jobs
WebSocket (real-time updates)
Bull Board (queue monitoring)

🗄️ Database & Storage

PostgreSQL (relational data)
Redis (cache & queue)
Pinecone/Chroma (vector DB)
MinIO/S3 (media storage)
Elasticsearch (full-text search)

🤖 AI / ML

OpenAI API (embeddings, summary)
Whisper API (speech-to-text)
Qwen (local/cheap summarization)
Hugging Face (topic modeling)
yt-dlp, gallery-dl (scrapers)

📱 WhatsApp Integration

whatsmeow (Go library)
Baileys (Node.js - alt)
WhatsApp Business API (official)
Webhook server (Express)
QR code pairing

🚀 DevOps

Docker + Docker Compose
Nginx (reverse proxy)
PM2 (process manager)
Cron jobs (scheduled tasks)
Grafana + Prometheus (monitoring)

📐 Database Schema

-- Links Table (Raw content)
CREATE TABLE links (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    url TEXT NOT NULL UNIQUE,
    platform VARCHAR(20) NOT NULL, -- instagram, twitter, tiktok, facebook
    source_chat_id VARCHAR(50),
    source_message_id VARCHAR(100),
    shared_by VARCHAR(100),
    shared_at TIMESTAMP DEFAULT NOW(),
    status VARCHAR(20) DEFAULT 'pending', -- pending, processing, completed, failed
    created_at TIMESTAMP DEFAULT NOW()
);

-- Content Table (Extracted metadata)
CREATE TABLE contents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    link_id UUID REFERENCES links(id),
    title TEXT,
    description TEXT,
    author VARCHAR(100),
    published_at TIMESTAMP,
    media_type VARCHAR(20), -- video, image, article
    media_url TEXT,
    thumbnail_url TEXT,
    transcript TEXT, -- S2T result
    raw_metadata JSONB,
    extracted_at TIMESTAMP DEFAULT NOW()
);

-- Clusters Table (Topic groups)
CREATE TABLE clusters (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name VARCHAR(200) NOT NULL,
    description TEXT,
    topic_vector vector(1536), -- OpenAI embedding
    content_count INTEGER DEFAULT 0,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Cluster-Content Junction
CREATE TABLE cluster_contents (
    cluster_id UUID REFERENCES clusters(id),
    content_id UUID REFERENCES contents(id),
    similarity_score FLOAT,
    PRIMARY KEY (cluster_id, content_id)
);

-- Summaries Table (AI-generated)
CREATE TABLE summaries (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    cluster_id UUID REFERENCES clusters(id),
    summary_text TEXT NOT NULL,
    key_insights JSONB,
    sentiment VARCHAR(20), -- positive, negative, neutral
    generated_by VARCHAR(50), -- model name
    generated_at TIMESTAMP DEFAULT NOW()
);

-- Feed/Dashboard Cache
CREATE TABLE feed_cache (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    feed_type VARCHAR(50), -- timeline, topic, trending
    content JSONB NOT NULL,
    generated_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP
);
            

🔌 Platform-Specific Extraction

Platform	Library	Data Extracted	S2T/OCR
📸 Instagram	instaloader, gallery-dl	Caption, hashtags, likes, comments	Video: Whisper
🐦 Twitter/X	ntscraper, snscrape	Tweet text, media, engagement	Video: Whisper
🎵 TikTok	yt-dlp (tiktok extract), TikTok-Api	Caption, hashtags, music, stats	Video: Whisper (wajib)
📘 Facebook	facebook-scraper	Post text, reactions, shares	Video: Whisper
▶️ YouTube	yt-dlp, youtube-transcript-api	Title, description, captions	Caption/Whisper

🧠 Semantic Clustering Algorithm

Approach: Hierarchical Clustering dengan Embeddings

Text Preprocessing: Clean transcript + caption + description
Embedding Generation: OpenAI text-embedding-3-small (cheap) atau local model
Similarity Matrix: Cosine similarity antar content
Hierarchical Clustering: Agglomerative clustering dengan threshold dinamis
Topic Labeling: LLM baca cluster → generate nama topik otomatis
Re-clustering: Incremental clustering untuk content baru (tidak recompute semua)

// Pseudo-code clustering
async function clusterContent(newContent) {
    // 1. Generate embedding
    const embedding = await openai.embeddings.create({
        model: "text-embedding-3-small",
        input: newContent.combinedText
    });
    
    // 2. Find similar clusters
    const similarClusters = await db.query(`
        SELECT cluster_id, 1 - (topic_vector <=> $1) as similarity
        FROM clusters
        WHERE 1 - (topic_vector <=> $1) > 0.85
        ORDER BY similarity DESC
        LIMIT 5
    `, [embedding]);
    
    // 3. Assign to cluster or create new
    if (similarClusters.length > 0 && similarClusters[0].similarity > 0.90) {
        await assignToCluster(newContent.id, similarClusters[0].cluster_id);
    } else {
        await createNewCluster(newContent, embedding);
    }
}
            

📊 Dashboard Views

1. Timeline View 📅

Stream kronologis semua link yang masuk, bisa filter by platform, date range, keyword.

2. Topic/Cluster View 🗂️

Konten dikelompokkan per topik dengan:

Nama topik (AI-generated)
Ringkasan singkat
Jumlah konten
Platform breakdown (pie chart)
Timeline dalam topik

3. Summary Feed 📝

Ringkasan harian/mingguan:

"Top 3 Topik Hari Ini"
Trending hashtags
Key insights per topik
Highlight konten penting

4. Search & Discovery 🔍

Full-text search (Elasticsearch)
Semantic search (vector similarity)
Filter: platform, date, sentiment, engagement

🚀 Implementation Phases

Phase	Duration	Deliverables	Priority
Phase 1: MVP	2-3 minggu	WhatsApp bot + basic extraction + simple dashboard	HIGH
Phase 2: Clustering	2 minggu	Embedding + clustering + topic labeling	HIGH
Phase 3: AI Summary	1-2 minggu	Auto-summary + insights + sentiment	MEDIUM
Phase 4: Advanced	2 minggu	Search, notifications, PWA, analytics	LOW

💰 Cost Estimation

Infrastructure (VPS)

Server: 4 CPU, 8GB RAM, 100GB SSD = ~$20-40/bulan
Database: PostgreSQL + Redis (self-hosted)
Storage: MinIO local / S3-compatible ($5-10/bulan)

API Costs (per 1000 links)

OpenAI Embeddings: ~$0.10 (text-embedding-3-small)
Whisper S2T: ~$0.36/jam audio ($0.006/min)
LLM Summary: ~$0.50-2.00 (Qwen lebih murah!)
Total per 1000 links: ~$2-5

                💡 Cost Optimization Tips:
                Pakai Qwen untuk summary (1 juta token free!)
Local Whisper model (higher upfront, free usage)
Cache embeddings (jangan regenerate)
Lazy loading untuk media

            

⚠️ Challenges & Solutions

Challenge	Solution
Rate limiting (Instagram, TikTok)	Rotating proxies, exponential backoff, respect robots.txt
Private/deleted content	Graceful degradation, cache important metadata immediately
Multi-language content	Language detection, translate ke English untuk embedding
Video processing lambat	Async queue, parallel processing, cache transcripts
Spam/low-quality content	Quality scoring, minimum engagement threshold

🎬 Quick Start (MVP)

# 1. Clone & Setup
git clone https://github.com/yourname/content-aggregator.git
cd content-aggregator
cp .env.example .env
# Edit .env dengan API keys

# 2. Start with Docker
docker-compose up -d

# 3. Setup WhatsApp
npm run whatsapp:pair
# Scan QR code

# 4. Access dashboard
open http://localhost:3000
            

Minimum Viable Product Features:

✅ WhatsApp bot terhubung
✅ Link detection & basic metadata
✅ Simple table view (URL, title, platform, timestamp)
✅ Search by keyword

Start simple, iterate fast! 🚀

🎯 Content Aggregation System