π Executive Summary
Sistem ini adalah platform agregasi konten cerdas yang mengumpulkan link dari grup WhatsApp, mengekstrak metadata dan transkrip, mengelompokkan berdasarkan topik (semantic clustering), dan menghasilkan ringkasan otomatis.
π― Core Value: Mengubah noise WhatsApp (puluhan link per hari) menjadi structured insights yang bisa dikonsumsi dalam 5 menit.
Fitur Utama:
- π₯ Auto-Ingestion: Monitoring link real-time dari WhatsApp
- π Metadata Extraction: Caption, deskripsi, thumbnail, engagement stats
- ποΈ Speech-to-Text: Transkrip audio/video otomatis
- π§ Semantic Clustering: Topic modeling lintas platform
- π Auto-Summary: Ringkasan per topik dengan AI
- π Structured Output: Dashboard HTML, timeline view, feed ringkasan
ποΈ System Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONTENT AGGREGATION SYSTEM β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β DATA SOURCES βββββΆβ INGESTION βββββΆβ PROCESSING β
β β β LAYER β β LAYER β
βββββββββββββββββββ€ ββββββββββββββββββββ€ ββββββββββββββββββββ€
β β’ WhatsApp API β β β’ WhatsApp Bot β β β’ Metadata β
β β’ Webhooks β β β’ Link Parser β β Extractor β
β β’ Manual Input β β β’ Queue Manager β β β’ S2T (Whisper) β
β β’ RSS Feeds β β β’ Rate Limiter β β β’ OCR (Tesseract)β
βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β OUTPUT LAYER ββββββ CLUSTERING ββββββ STORAGE β
β β β & AI β β LAYER β
βββββββββββββββββββ€ ββββββββββββββββββββ€ ββββββββββββββββββββ€
β β’ HTML Dashboardβ β β’ Embedding β β β’ PostgreSQL β
β β’ JSON API β β (OpenAI) β β β’ Redis Cache β
β β’ RSS Feed β β β’ Topic Model β β β’ S3/Object β
β β’ Notifications β β β’ LLM Summary β β β’ Vector DB β
βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
Data Flow:
Step 1: Ingestion π
WhatsApp bot monitor grup β Capture link β Validasi URL β Masuk queue (Redis/RabbitMQ)
Step 2: Extraction π
Worker proses queue β Scrape metadata (Open Graph, JSON-LD) β Download media β Speech-to-text untuk video β OCR untuk gambar
Step 3: Embedding & Clustering π§
Generate text embedding (OpenAI/Local) β Semantic similarity β Hierarchical clustering β Topic labeling otomatis
Step 4: Summarization π
LLM (Qwen/GPT-4) baca cluster β Generate ringkasan β Extract key insights β Sentiment analysis
Step 5: Presentation π
Update dashboard HTML β Generate RSS β Kirim notifikasi (Telegram/Email) β Archive ke database
π Database Schema
-- Links Table (Raw content)
CREATE TABLE links (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
url TEXT NOT NULL UNIQUE,
platform VARCHAR(20) NOT NULL, -- instagram, twitter, tiktok, facebook
source_chat_id VARCHAR(50),
source_message_id VARCHAR(100),
shared_by VARCHAR(100),
shared_at TIMESTAMP DEFAULT NOW(),
status VARCHAR(20) DEFAULT 'pending', -- pending, processing, completed, failed
created_at TIMESTAMP DEFAULT NOW()
);
-- Content Table (Extracted metadata)
CREATE TABLE contents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
link_id UUID REFERENCES links(id),
title TEXT,
description TEXT,
author VARCHAR(100),
published_at TIMESTAMP,
media_type VARCHAR(20), -- video, image, article
media_url TEXT,
thumbnail_url TEXT,
transcript TEXT, -- S2T result
raw_metadata JSONB,
extracted_at TIMESTAMP DEFAULT NOW()
);
-- Clusters Table (Topic groups)
CREATE TABLE clusters (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(200) NOT NULL,
description TEXT,
topic_vector vector(1536), -- OpenAI embedding
content_count INTEGER DEFAULT 0,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Cluster-Content Junction
CREATE TABLE cluster_contents (
cluster_id UUID REFERENCES clusters(id),
content_id UUID REFERENCES contents(id),
similarity_score FLOAT,
PRIMARY KEY (cluster_id, content_id)
);
-- Summaries Table (AI-generated)
CREATE TABLE summaries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
cluster_id UUID REFERENCES clusters(id),
summary_text TEXT NOT NULL,
key_insights JSONB,
sentiment VARCHAR(20), -- positive, negative, neutral
generated_by VARCHAR(50), -- model name
generated_at TIMESTAMP DEFAULT NOW()
);
-- Feed/Dashboard Cache
CREATE TABLE feed_cache (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
feed_type VARCHAR(50), -- timeline, topic, trending
content JSONB NOT NULL,
generated_at TIMESTAMP DEFAULT NOW(),
expires_at TIMESTAMP
);
π§ Semantic Clustering Algorithm
Approach: Hierarchical Clustering dengan Embeddings
- Text Preprocessing: Clean transcript + caption + description
- Embedding Generation: OpenAI text-embedding-3-small (cheap) atau local model
- Similarity Matrix: Cosine similarity antar content
- Hierarchical Clustering: Agglomerative clustering dengan threshold dinamis
- Topic Labeling: LLM baca cluster β generate nama topik otomatis
- Re-clustering: Incremental clustering untuk content baru (tidak recompute semua)
// Pseudo-code clustering
async function clusterContent(newContent) {
// 1. Generate embedding
const embedding = await openai.embeddings.create({
model: "text-embedding-3-small",
input: newContent.combinedText
});
// 2. Find similar clusters
const similarClusters = await db.query(`
SELECT cluster_id, 1 - (topic_vector <=> $1) as similarity
FROM clusters
WHERE 1 - (topic_vector <=> $1) > 0.85
ORDER BY similarity DESC
LIMIT 5
`, [embedding]);
// 3. Assign to cluster or create new
if (similarClusters.length > 0 && similarClusters[0].similarity > 0.90) {
await assignToCluster(newContent.id, similarClusters[0].cluster_id);
} else {
await createNewCluster(newContent, embedding);
}
}
π Dashboard Views
1. Timeline View π
Stream kronologis semua link yang masuk, bisa filter by platform, date range, keyword.
2. Topic/Cluster View ποΈ
Konten dikelompokkan per topik dengan:
- Nama topik (AI-generated)
- Ringkasan singkat
- Jumlah konten
- Platform breakdown (pie chart)
- Timeline dalam topik
3. Summary Feed π
Ringkasan harian/mingguan:
- "Top 3 Topik Hari Ini"
- Trending hashtags
- Key insights per topik
- Highlight konten penting
4. Search & Discovery π
- Full-text search (Elasticsearch)
- Semantic search (vector similarity)
- Filter: platform, date, sentiment, engagement
π¬ Quick Start (MVP)
# 1. Clone & Setup
git clone https://github.com/yourname/content-aggregator.git
cd content-aggregator
cp .env.example .env
# Edit .env dengan API keys
# 2. Start with Docker
docker-compose up -d
# 3. Setup WhatsApp
npm run whatsapp:pair
# Scan QR code
# 4. Access dashboard
open http://localhost:3000
Minimum Viable Product Features:
- β
WhatsApp bot terhubung
- β
Link detection & basic metadata
- β
Simple table view (URL, title, platform, timestamp)
- β
Search by keyword
Start simple, iterate fast! π