📄 idea-scraper-workflow.md 10,424 bytes Apr 22, 2026 📋 Raw

Idea Scraper Workflow — Feature Spec

Status: Proposed
Date: 2026-04-22
Requester: Matt
Owner: Socrates (Backend) + Wadsworth (Coordination)
Priority: P2 (after v2 content pipeline stable)

Goal

Transform raw logs (terminal history, debugging sessions, agent conversations) into structured content briefs automatically. The system scrapes for struggle patterns, technical decisions, and household moments, then surfaces them as draft briefs for Matt to approve.

Data Sources

Source	Location	Content Type	Scrape Frequency
Command logs	`~/.bash_history`, `~/.zsh_history`	Terminal commands, errors	Daily
OpenClaw memory	`memory/YYYY-MM-DD.md`	Agent conversations, decisions	Real-time (on write)
Agent workspaces	`workspace-/memory/.md`	Technical work, debugging	Hourly
Shared project docs	`shared/project-docs/*`	Specs, post-mortems	On commit
Git commits	Repo history	Code changes with messages	On push
Heartbeat state	`memory/heartbeat-state.json`	System events, failures	On alert

Scrape Targets

Pattern: The Struggle

TRIGGER PHRASES:
- "It was [time] when..."
- "I thought [X] would work..."
- "I tried [Y] but..."
- "[N] hours later..."
- "Aundrea said..."
- "The [system] went down..."

TECHNICAL SIGNALS:
- Commands with exit codes ≠ 0
- Multiple sequential similar commands (trial/error)
- SSH sessions lasting >1 hour
- sudo commands at odd hours
- service restart loops

Pattern: The Realization

- "Then I realized..."
- "The problem was..."
- "Turns out..."
- "What I missed..."
- Comments in code explaining "why this workaround"

Pattern: The Cost

- Timestamps spanning hours
- References to family members
- Sleep schedule disruption
- "I should have..." (hindsight)
- "Next time..." (lessons)

Workflow

┌─────────────────────────────────────────────────────────────┐
│                    IDEA SCRAPER PIPELINE                    │
└─────────────────────────────────────────────────────────────┘

  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
  │  Source  │   │  Source  │   │  Source  │   │  Source  │
  │  Logs    │   │  Memory  │   │   Git    │   │ Commands │
  └────┬─────┘   └────┬─────┘   └────┬─────┘   └────┬─────┘
       │              │              │              │
       └──────────────┴──────────────┴──────────────┘
                      │
                      ▼
            ┌──────────────────┐
            │  Pattern Matcher │  (Local LLM: phi4:14b)
            │  - Struggle      │
            │  - Realization   │
            │  - Cost          │
            └────────┬─────────┘
                     │
                     ▼
            ┌──────────────────┐
            │  Idea Candidate  │
            │  - Extracted     │
            │  - Structured    │
            │  - Scored        │
            └────────┬─────────┘
                     │
                     ▼
            ┌──────────────────┐
            │  Matt Review     │
            │  Telegram DM:    │
            │  "New idea: [X]  │
            │   Turn into      │
            │   brief? [Yes]   │
            │   [No] [Edit]"   │
            └────────┬─────────┘
                     │
         ┌───────────┴───────────┐
         │                       │
         ▼                       ▼
    ┌─────────┐          ┌────────────┐
    │ Reject  │          │  Approve   │
    │ (log)   │          │            │
    └─────────┘          └─────┬──────┘
                               │
                               ▼
                    ┌──────────────────┐
                    │  Auto-create     │
                    │  Brief Draft     │
                    │  - Fill hook     │
                    │  - Fill struggle │
                    │  - Leave gaps    │
                    │    for Matt      │
                    └────────┬─────────┘
                             │
                             ▼
                    ┌──────────────────┐
                    │  Content         │
                    │  Pipeline v2     │
                    │  (normal flow)   │
                    └──────────────────┘

Integration Points

With Content Pipeline v2

The scraper feeds directly into the v2 system:

Idea → Brief: Scraper creates content_briefs_v2 row with status='idea'
Matt review: Telegram DM for approval
Brief completion: Matt fills gaps (the moment, the fix, reflection)
Normal flow: Submit → Approve → Generate → Publish

With Daily Heartbeat

The scraper runs as part of heartbeat:daily:
- Scans previous 24h of logs
- Generates candidate ideas
- Batches them into single digest DM
- "3 story ideas from yesterday's chaos. Review?"

Technical Requirements

Socrates Tasks

Component	File	Purpose
Log parser	`blog/scraper/parser.py`	Extract structured events from logs
Pattern matcher	`blog/scraper/patterns.py`	Regex + heuristic detection
LLM scorer	`blog/scraper/scorer.py`	Local phi4:14b rates "story potential"
Idea formatter	`blog/scraper/formatter.py`	Convert to brief structure
Telegram digest	`blog/scraper/digest.py`	Batch and notify

Data Storage

CREATE TABLE content_ideas (
    id TEXT PRIMARY KEY,
    source_type TEXT CHECK(source_type IN ('log', 'memory', 'git', 'command')),
    source_ref TEXT,  -- file path or git hash
    extracted_at TIMESTAMP,
    raw_excerpt TEXT,
    -- Structured fields
    struggle_indicator TEXT,
    cost_mentions JSON,
    technical_context TEXT,
    -- Scoring
    story_potential REAL,  -- 0-100 from LLM
    auto_score REAL,         -- heuristic
    -- Workflow
    status TEXT CHECK(status IN ('pending', 'approved', 'rejected', 'converted')),
    brief_id TEXT REFERENCES content_briefs_v2(id),
    reviewed_by TEXT,
    reviewed_at TIMESTAMP,
    notes TEXT
);

Example Output

From Log Entry

# ~/.bash_history excerpt
02:17  sudo systemctl restart pihole-FTL
02:18  tail -f /var/log/pihole/pihole.log
02:23  nano /etc/pihole/adlists.list
02:24  sudo pihole -g
02:45  sudo pihole -g
02:46  curl -I facebook.com
02:47  nano /etc/pihole/adlists.list  # comment out line
02:48  sudo pihole -g
02:49  echo "I am the monster I was hunting"

Scraped Idea

{
  "id": "idea-20260422-0247",
  "source_type": "command",
  "source_ref": "~/.bash_history",
  "extracted_at": "2026-04-22T06:00:00Z",
  "raw_excerpt": "02:17 restart... 02:47 comment out line... 02:49 echo 'I am the monster...'",
  "struggle_indicator": "Multiple restart attempts, late hour, self-aware admission",
  "cost_mentions": ["02:17", "02:47", "hunting"],
  "technical_context": "Pi-hole DNS, adlist configuration, service restart",
  "story_potential": 87,
  "auto_score": 85,
  "suggested_title": "The Night I Broke DNS and Found the Monster",
  "suggested_category": "Home Lab Growing Pains"
}

Telegram Digest

3 Story Ideas from Yesterday's Chaos

"The Night I Broke DNS" (87/100) — Pi-hole, 2 AM, self-aware monster quote
[Turn into brief] [Skip]

"Git History Archaeology" (62/100) — Recovering from a force-push
[Turn into brief] [Skip]

"The Email Pipeline that Wouldn't" (71/100) — IMAP quirks, retry loops
[Turn into brief] [Skip]

[Review all] [Dismiss today's batch]

Success Metrics

Metric	Target	How Measured
Ideas surfaced / week	5-10	DB count, filtered by quality
Conversion to briefs	30%	ideas → briefs created
False positive rate	<20%	Matt rejection rate
Time to publish	<48h	idea → published post

Open Questions

Privacy scope: Include Aundrea's terminal logs? Kids' commands? (Recommend: Matt-only initially)
Sensitive data: Auto-redact passwords, tokens, IP addresses? (Recommend: yes, regex patterns)
Frequency: Real-time (every log write) or batched (daily digest)? (Recommend: daily digest to avoid spam)
Veto power: Can Matt permanently blacklist patterns? (Recommend: yes, ~/.config/idea-scraper/ignore-patterns.txt)

Implementation Phases

Phase 0: Foundation (Socrates)

SQLite schema for content_ideas
Log parser for bash history
Pattern matcher (regex only)
Basic Telegram notification

Phase 1: Intelligence (Socrates)

Local LLM scorer (phi4:14b)
Memory file parsing
Git commit scanning
Daily digest format

Matt feedback loop
False positive tuning
Category auto-suggestion
Style matching to existing posts

Phase 3: Polish (Wadsworth)

Heartbeat integration
Weekly "best of" summaries
Archive/reject pattern learning
Export to other formats (Twitter threads, etc.)

Next Step: Socrates reviews feasibility after completing Content Pipeline v2 Phase 1.5

Requested by: Matt
Documented by: Wadsworth 📋

← Back