Idea Scraper Workflow — Feature Spec
Status: Proposed
Date: 2026-04-22
Requester: Matt
Owner: Socrates (Backend) + Wadsworth (Coordination)
Priority: P2 (after v2 content pipeline stable)
Goal
Transform raw logs (terminal history, debugging sessions, agent conversations) into structured content briefs automatically. The system scrapes for struggle patterns, technical decisions, and household moments, then surfaces them as draft briefs for Matt to approve.
Data Sources
| Source | Location | Content Type | Scrape Frequency |
|---|---|---|---|
| Command logs | ~/.bash_history, ~/.zsh_history |
Terminal commands, errors | Daily |
| OpenClaw memory | memory/YYYY-MM-DD.md |
Agent conversations, decisions | Real-time (on write) |
| Agent workspaces | workspace-*/memory/*.md |
Technical work, debugging | Hourly |
| Shared project docs | shared/project-docs/* |
Specs, post-mortems | On commit |
| Git commits | Repo history | Code changes with messages | On push |
| Heartbeat state | memory/heartbeat-state.json |
System events, failures | On alert |
Scrape Targets
Pattern: The Struggle
TRIGGER PHRASES:
- "It was [time] when..."
- "I thought [X] would work..."
- "I tried [Y] but..."
- "[N] hours later..."
- "Aundrea said..."
- "The [system] went down..."
TECHNICAL SIGNALS:
- Commands with exit codes ≠ 0
- Multiple sequential similar commands (trial/error)
- SSH sessions lasting >1 hour
- sudo commands at odd hours
- service restart loops
Pattern: The Realization
- "Then I realized..."
- "The problem was..."
- "Turns out..."
- "What I missed..."
- Comments in code explaining "why this workaround"
Pattern: The Cost
- Timestamps spanning hours
- References to family members
- Sleep schedule disruption
- "I should have..." (hindsight)
- "Next time..." (lessons)
Workflow
┌─────────────────────────────────────────────────────────────┐
│ IDEA SCRAPER PIPELINE │
└─────────────────────────────────────────────────────────────┘
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Source │ │ Source │ │ Source │ │ Source │
│ Logs │ │ Memory │ │ Git │ │ Commands │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │ │
└──────────────┴──────────────┴──────────────┘
│
▼
┌──────────────────┐
│ Pattern Matcher │ (Local LLM: phi4:14b)
│ - Struggle │
│ - Realization │
│ - Cost │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Idea Candidate │
│ - Extracted │
│ - Structured │
│ - Scored │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Matt Review │
│ Telegram DM: │
│ "New idea: [X] │
│ Turn into │
│ brief? [Yes] │
│ [No] [Edit]" │
└────────┬─────────┘
│
┌───────────┴───────────┐
│ │
▼ ▼
┌─────────┐ ┌────────────┐
│ Reject │ │ Approve │
│ (log) │ │ │
└─────────┘ └─────┬──────┘
│
▼
┌──────────────────┐
│ Auto-create │
│ Brief Draft │
│ - Fill hook │
│ - Fill struggle │
│ - Leave gaps │
│ for Matt │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Content │
│ Pipeline v2 │
│ (normal flow) │
└──────────────────┘
Integration Points
With Content Pipeline v2
The scraper feeds directly into the v2 system:
- Idea → Brief: Scraper creates
content_briefs_v2row withstatus='idea' - Matt review: Telegram DM for approval
- Brief completion: Matt fills gaps (the moment, the fix, reflection)
- Normal flow: Submit → Approve → Generate → Publish
With Daily Heartbeat
The scraper runs as part of heartbeat:daily:
- Scans previous 24h of logs
- Generates candidate ideas
- Batches them into single digest DM
- "3 story ideas from yesterday's chaos. Review?"
Technical Requirements
Socrates Tasks
| Component | File | Purpose |
|---|---|---|
| Log parser | blog/scraper/parser.py |
Extract structured events from logs |
| Pattern matcher | blog/scraper/patterns.py |
Regex + heuristic detection |
| LLM scorer | blog/scraper/scorer.py |
Local phi4:14b rates "story potential" |
| Idea formatter | blog/scraper/formatter.py |
Convert to brief structure |
| Telegram digest | blog/scraper/digest.py |
Batch and notify |
Data Storage
CREATE TABLE content_ideas (
id TEXT PRIMARY KEY,
source_type TEXT CHECK(source_type IN ('log', 'memory', 'git', 'command')),
source_ref TEXT, -- file path or git hash
extracted_at TIMESTAMP,
raw_excerpt TEXT,
-- Structured fields
struggle_indicator TEXT,
cost_mentions JSON,
technical_context TEXT,
-- Scoring
story_potential REAL, -- 0-100 from LLM
auto_score REAL, -- heuristic
-- Workflow
status TEXT CHECK(status IN ('pending', 'approved', 'rejected', 'converted')),
brief_id TEXT REFERENCES content_briefs_v2(id),
reviewed_by TEXT,
reviewed_at TIMESTAMP,
notes TEXT
);
Example Output
From Log Entry
# ~/.bash_history excerpt
02:17 sudo systemctl restart pihole-FTL
02:18 tail -f /var/log/pihole/pihole.log
02:23 nano /etc/pihole/adlists.list
02:24 sudo pihole -g
02:45 sudo pihole -g
02:46 curl -I facebook.com
02:47 nano /etc/pihole/adlists.list # comment out line
02:48 sudo pihole -g
02:49 echo "I am the monster I was hunting"
Scraped Idea
{
"id": "idea-20260422-0247",
"source_type": "command",
"source_ref": "~/.bash_history",
"extracted_at": "2026-04-22T06:00:00Z",
"raw_excerpt": "02:17 restart... 02:47 comment out line... 02:49 echo 'I am the monster...'",
"struggle_indicator": "Multiple restart attempts, late hour, self-aware admission",
"cost_mentions": ["02:17", "02:47", "hunting"],
"technical_context": "Pi-hole DNS, adlist configuration, service restart",
"story_potential": 87,
"auto_score": 85,
"suggested_title": "The Night I Broke DNS and Found the Monster",
"suggested_category": "Home Lab Growing Pains"
}
Telegram Digest
3 Story Ideas from Yesterday's Chaos
"The Night I Broke DNS" (87/100) — Pi-hole, 2 AM, self-aware monster quote
[Turn into brief] [Skip]"Git History Archaeology" (62/100) — Recovering from a force-push
[Turn into brief] [Skip]"The Email Pipeline that Wouldn't" (71/100) — IMAP quirks, retry loops
[Turn into brief] [Skip][Review all] [Dismiss today's batch]
Success Metrics
| Metric | Target | How Measured |
|---|---|---|
| Ideas surfaced / week | 5-10 | DB count, filtered by quality |
| Conversion to briefs | 30% | ideas → briefs created |
| False positive rate | <20% | Matt rejection rate |
| Time to publish | <48h | idea → published post |
Open Questions
- Privacy scope: Include Aundrea's terminal logs? Kids' commands? (Recommend: Matt-only initially)
- Sensitive data: Auto-redact passwords, tokens, IP addresses? (Recommend: yes, regex patterns)
- Frequency: Real-time (every log write) or batched (daily digest)? (Recommend: daily digest to avoid spam)
- Veto power: Can Matt permanently blacklist patterns? (Recommend: yes,
~/.config/idea-scraper/ignore-patterns.txt)
Implementation Phases
Phase 0: Foundation (Socrates)
- SQLite schema for
content_ideas - Log parser for bash history
- Pattern matcher (regex only)
- Basic Telegram notification
Phase 1: Intelligence (Socrates)
- Local LLM scorer (phi4:14b)
- Memory file parsing
- Git commit scanning
- Daily digest format
Phase 2: Refinement (Both)
- Matt feedback loop
- False positive tuning
- Category auto-suggestion
- Style matching to existing posts
Phase 3: Polish (Wadsworth)
- Heartbeat integration
- Weekly "best of" summaries
- Archive/reject pattern learning
- Export to other formats (Twitter threads, etc.)
Next Step: Socrates reviews feasibility after completing Content Pipeline v2 Phase 1.5
Requested by: Matt
Documented by: Wadsworth 📋