# Idea Scraper Workflow — Feature Spec **Status:** Proposed **Date:** 2026-04-22 **Requester:** Matt **Owner:** Socrates (Backend) + Wadsworth (Coordination) **Priority:** P2 (after v2 content pipeline stable) --- ## Goal Transform raw logs (terminal history, debugging sessions, agent conversations) into structured content briefs automatically. The system scrapes for struggle patterns, technical decisions, and household moments, then surfaces them as draft briefs for Matt to approve. --- ## Data Sources | Source | Location | Content Type | Scrape Frequency | |--------|----------|--------------|------------------| | **Command logs** | `~/.bash_history`, `~/.zsh_history` | Terminal commands, errors | Daily | | **OpenClaw memory** | `memory/YYYY-MM-DD.md` | Agent conversations, decisions | Real-time (on write) | | **Agent workspaces** | `workspace-*/memory/*.md` | Technical work, debugging | Hourly | | **Shared project docs** | `shared/project-docs/*` | Specs, post-mortems | On commit | | **Git commits** | Repo history | Code changes with messages | On push | | **Heartbeat state** | `memory/heartbeat-state.json` | System events, failures | On alert | --- ## Scrape Targets ### Pattern: The Struggle ``` TRIGGER PHRASES: - "It was [time] when..." - "I thought [X] would work..." - "I tried [Y] but..." - "[N] hours later..." - "Aundrea said..." - "The [system] went down..." TECHNICAL SIGNALS: - Commands with exit codes ≠ 0 - Multiple sequential similar commands (trial/error) - SSH sessions lasting >1 hour - sudo commands at odd hours - service restart loops ``` ### Pattern: The Realization ``` - "Then I realized..." - "The problem was..." - "Turns out..." - "What I missed..." - Comments in code explaining "why this workaround" ``` ### Pattern: The Cost ``` - Timestamps spanning hours - References to family members - Sleep schedule disruption - "I should have..." (hindsight) - "Next time..." (lessons) ``` --- ## Workflow ``` ┌─────────────────────────────────────────────────────────────┐ │ IDEA SCRAPER PIPELINE │ └─────────────────────────────────────────────────────────────┘ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Source │ │ Source │ │ Source │ │ Source │ │ Logs │ │ Memory │ │ Git │ │ Commands │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ └──────────────┴──────────────┴──────────────┘ │ ▼ ┌──────────────────┐ │ Pattern Matcher │ (Local LLM: phi4:14b) │ - Struggle │ │ - Realization │ │ - Cost │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Idea Candidate │ │ - Extracted │ │ - Structured │ │ - Scored │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Matt Review │ │ Telegram DM: │ │ "New idea: [X] │ │ Turn into │ │ brief? [Yes] │ │ [No] [Edit]" │ └────────┬─────────┘ │ ┌───────────┴───────────┐ │ │ ▼ ▼ ┌─────────┐ ┌────────────┐ │ Reject │ │ Approve │ │ (log) │ │ │ └─────────┘ └─────┬──────┘ │ ▼ ┌──────────────────┐ │ Auto-create │ │ Brief Draft │ │ - Fill hook │ │ - Fill struggle │ │ - Leave gaps │ │ for Matt │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Content │ │ Pipeline v2 │ │ (normal flow) │ └──────────────────┘ ``` --- ## Integration Points ### With Content Pipeline v2 The scraper feeds directly into the v2 system: 1. **Idea → Brief:** Scraper creates `content_briefs_v2` row with `status='idea'` 2. **Matt review:** Telegram DM for approval 3. **Brief completion:** Matt fills gaps (the moment, the fix, reflection) 4. **Normal flow:** Submit → Approve → Generate → Publish ### With Daily Heartbeat The scraper runs as part of `heartbeat:daily`: - Scans previous 24h of logs - Generates candidate ideas - Batches them into single digest DM - "3 story ideas from yesterday's chaos. Review?" --- ## Technical Requirements ### Socrates Tasks | Component | File | Purpose | |-----------|------|---------| | Log parser | `blog/scraper/parser.py` | Extract structured events from logs | | Pattern matcher | `blog/scraper/patterns.py` | Regex + heuristic detection | | LLM scorer | `blog/scraper/scorer.py` | Local phi4:14b rates "story potential" | | Idea formatter | `blog/scraper/formatter.py` | Convert to brief structure | | Telegram digest | `blog/scraper/digest.py` | Batch and notify | ### Data Storage ```sql CREATE TABLE content_ideas ( id TEXT PRIMARY KEY, source_type TEXT CHECK(source_type IN ('log', 'memory', 'git', 'command')), source_ref TEXT, -- file path or git hash extracted_at TIMESTAMP, raw_excerpt TEXT, -- Structured fields struggle_indicator TEXT, cost_mentions JSON, technical_context TEXT, -- Scoring story_potential REAL, -- 0-100 from LLM auto_score REAL, -- heuristic -- Workflow status TEXT CHECK(status IN ('pending', 'approved', 'rejected', 'converted')), brief_id TEXT REFERENCES content_briefs_v2(id), reviewed_by TEXT, reviewed_at TIMESTAMP, notes TEXT ); ``` --- ## Example Output ### From Log Entry ```bash # ~/.bash_history excerpt 02:17 sudo systemctl restart pihole-FTL 02:18 tail -f /var/log/pihole/pihole.log 02:23 nano /etc/pihole/adlists.list 02:24 sudo pihole -g 02:45 sudo pihole -g 02:46 curl -I facebook.com 02:47 nano /etc/pihole/adlists.list # comment out line 02:48 sudo pihole -g 02:49 echo "I am the monster I was hunting" ``` ### Scraped Idea ```json { "id": "idea-20260422-0247", "source_type": "command", "source_ref": "~/.bash_history", "extracted_at": "2026-04-22T06:00:00Z", "raw_excerpt": "02:17 restart... 02:47 comment out line... 02:49 echo 'I am the monster...'", "struggle_indicator": "Multiple restart attempts, late hour, self-aware admission", "cost_mentions": ["02:17", "02:47", "hunting"], "technical_context": "Pi-hole DNS, adlist configuration, service restart", "story_potential": 87, "auto_score": 85, "suggested_title": "The Night I Broke DNS and Found the Monster", "suggested_category": "Home Lab Growing Pains" } ``` ### Telegram Digest > **3 Story Ideas from Yesterday's Chaos** > > 1. **"The Night I Broke DNS"** (87/100) — *Pi-hole, 2 AM, self-aware monster quote* > [Turn into brief] [Skip] > > 2. **"Git History Archaeology"** (62/100) — *Recovering from a force-push* > [Turn into brief] [Skip] > > 3. **"The Email Pipeline that Wouldn't"** (71/100) — *IMAP quirks, retry loops* > [Turn into brief] [Skip] > > [Review all] [Dismiss today's batch] --- ## Success Metrics | Metric | Target | How Measured | |--------|--------|--------------| | Ideas surfaced / week | 5-10 | DB count, filtered by quality | | Conversion to briefs | 30% | ideas → briefs created | | False positive rate | <20% | Matt rejection rate | | Time to publish | <48h | idea → published post | --- ## Open Questions 1. **Privacy scope:** Include Aundrea's terminal logs? Kids' commands? (Recommend: Matt-only initially) 2. **Sensitive data:** Auto-redact passwords, tokens, IP addresses? (Recommend: yes, regex patterns) 3. **Frequency:** Real-time (every log write) or batched (daily digest)? (Recommend: daily digest to avoid spam) 4. **Veto power:** Can Matt permanently blacklist patterns? (Recommend: yes, `~/.config/idea-scraper/ignore-patterns.txt`) --- ## Implementation Phases ### Phase 0: Foundation (Socrates) - SQLite schema for `content_ideas` - Log parser for bash history - Pattern matcher (regex only) - Basic Telegram notification ### Phase 1: Intelligence (Socrates) - Local LLM scorer (phi4:14b) - Memory file parsing - Git commit scanning - Daily digest format ### Phase 2: Refinement (Both) - Matt feedback loop - False positive tuning - Category auto-suggestion - Style matching to existing posts ### Phase 3: Polish (Wadsworth) - Heartbeat integration - Weekly "best of" summaries - Archive/reject pattern learning - Export to other formats (Twitter threads, etc.) --- **Next Step:** Socrates reviews feasibility after completing Content Pipeline v2 Phase 1.5 --- *Requested by: Matt* *Documented by: Wadsworth 📋*