📄 idea-scraper-workflow.md 10,424 bytes Apr 22, 2026 📋 Raw

Idea Scraper Workflow — Feature Spec

Status: Proposed
Date: 2026-04-22
Requester: Matt
Owner: Socrates (Backend) + Wadsworth (Coordination)
Priority: P2 (after v2 content pipeline stable)


Goal

Transform raw logs (terminal history, debugging sessions, agent conversations) into structured content briefs automatically. The system scrapes for struggle patterns, technical decisions, and household moments, then surfaces them as draft briefs for Matt to approve.


Data Sources

Source Location Content Type Scrape Frequency
Command logs ~/.bash_history, ~/.zsh_history Terminal commands, errors Daily
OpenClaw memory memory/YYYY-MM-DD.md Agent conversations, decisions Real-time (on write)
Agent workspaces workspace-*/memory/*.md Technical work, debugging Hourly
Shared project docs shared/project-docs/* Specs, post-mortems On commit
Git commits Repo history Code changes with messages On push
Heartbeat state memory/heartbeat-state.json System events, failures On alert

Scrape Targets

Pattern: The Struggle

TRIGGER PHRASES:
- "It was [time] when..."
- "I thought [X] would work..."
- "I tried [Y] but..."
- "[N] hours later..."
- "Aundrea said..."
- "The [system] went down..."

TECHNICAL SIGNALS:
- Commands with exit codes  0
- Multiple sequential similar commands (trial/error)
- SSH sessions lasting >1 hour
- sudo commands at odd hours
- service restart loops

Pattern: The Realization

- "Then I realized..."
- "The problem was..."
- "Turns out..."
- "What I missed..."
- Comments in code explaining "why this workaround"

Pattern: The Cost

- Timestamps spanning hours
- References to family members
- Sleep schedule disruption
- "I should have..." (hindsight)
- "Next time..." (lessons)

Workflow

┌─────────────────────────────────────────────────────────────┐
                    IDEA SCRAPER PIPELINE                    
└─────────────────────────────────────────────────────────────┘

  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
    Source       Source       Source       Source  
    Logs         Memory        Git        Commands 
  └────┬─────┘   └────┬─────┘   └────┬─────┘   └────┬─────┘
                                                 
       └──────────────┴──────────────┴──────────────┘
                      
                      
            ┌──────────────────┐
              Pattern Matcher   (Local LLM: phi4:14b)
              - Struggle      
              - Realization   
              - Cost          
            └────────┬─────────┘
                     
                     
            ┌──────────────────┐
              Idea Candidate  
              - Extracted     
              - Structured    
              - Scored        
            └────────┬─────────┘
                     
                     
            ┌──────────────────┐
              Matt Review     
              Telegram DM:    
              "New idea: [X]  │
            │   Turn into      │
            │   brief? [Yes]   │
            │   [No] [Edit]"   
            └────────┬─────────┘
                     
         ┌───────────┴───────────┐
                                
                                
    ┌─────────┐          ┌────────────┐
     Reject              Approve   
     (log)                         
    └─────────┘          └─────┬──────┘
                               
                               
                    ┌──────────────────┐
                      Auto-create     
                      Brief Draft     
                      - Fill hook     
                      - Fill struggle 
                      - Leave gaps    
                        for Matt      
                    └────────┬─────────┘
                             
                             
                    ┌──────────────────┐
                      Content         
                      Pipeline v2     
                      (normal flow)   
                    └──────────────────┘

Integration Points

With Content Pipeline v2

The scraper feeds directly into the v2 system:

  1. Idea → Brief: Scraper creates content_briefs_v2 row with status='idea'
  2. Matt review: Telegram DM for approval
  3. Brief completion: Matt fills gaps (the moment, the fix, reflection)
  4. Normal flow: Submit → Approve → Generate → Publish

With Daily Heartbeat

The scraper runs as part of heartbeat:daily:
- Scans previous 24h of logs
- Generates candidate ideas
- Batches them into single digest DM
- "3 story ideas from yesterday's chaos. Review?"


Technical Requirements

Socrates Tasks

Component File Purpose
Log parser blog/scraper/parser.py Extract structured events from logs
Pattern matcher blog/scraper/patterns.py Regex + heuristic detection
LLM scorer blog/scraper/scorer.py Local phi4:14b rates "story potential"
Idea formatter blog/scraper/formatter.py Convert to brief structure
Telegram digest blog/scraper/digest.py Batch and notify

Data Storage

CREATE TABLE content_ideas (
    id TEXT PRIMARY KEY,
    source_type TEXT CHECK(source_type IN ('log', 'memory', 'git', 'command')),
    source_ref TEXT,  -- file path or git hash
    extracted_at TIMESTAMP,
    raw_excerpt TEXT,
    -- Structured fields
    struggle_indicator TEXT,
    cost_mentions JSON,
    technical_context TEXT,
    -- Scoring
    story_potential REAL,  -- 0-100 from LLM
    auto_score REAL,         -- heuristic
    -- Workflow
    status TEXT CHECK(status IN ('pending', 'approved', 'rejected', 'converted')),
    brief_id TEXT REFERENCES content_briefs_v2(id),
    reviewed_by TEXT,
    reviewed_at TIMESTAMP,
    notes TEXT
);

Example Output

From Log Entry

# ~/.bash_history excerpt
02:17  sudo systemctl restart pihole-FTL
02:18  tail -f /var/log/pihole/pihole.log
02:23  nano /etc/pihole/adlists.list
02:24  sudo pihole -g
02:45  sudo pihole -g
02:46  curl -I facebook.com
02:47  nano /etc/pihole/adlists.list  # comment out line
02:48  sudo pihole -g
02:49  echo "I am the monster I was hunting"

Scraped Idea

{
  "id": "idea-20260422-0247",
  "source_type": "command",
  "source_ref": "~/.bash_history",
  "extracted_at": "2026-04-22T06:00:00Z",
  "raw_excerpt": "02:17 restart... 02:47 comment out line... 02:49 echo 'I am the monster...'",
  "struggle_indicator": "Multiple restart attempts, late hour, self-aware admission",
  "cost_mentions": ["02:17", "02:47", "hunting"],
  "technical_context": "Pi-hole DNS, adlist configuration, service restart",
  "story_potential": 87,
  "auto_score": 85,
  "suggested_title": "The Night I Broke DNS and Found the Monster",
  "suggested_category": "Home Lab Growing Pains"
}

Telegram Digest

3 Story Ideas from Yesterday's Chaos

  1. "The Night I Broke DNS" (87/100) — Pi-hole, 2 AM, self-aware monster quote
    [Turn into brief] [Skip]

  2. "Git History Archaeology" (62/100) — Recovering from a force-push
    [Turn into brief] [Skip]

  3. "The Email Pipeline that Wouldn't" (71/100) — IMAP quirks, retry loops
    [Turn into brief] [Skip]

[Review all] [Dismiss today's batch]


Success Metrics

Metric Target How Measured
Ideas surfaced / week 5-10 DB count, filtered by quality
Conversion to briefs 30% ideas → briefs created
False positive rate <20% Matt rejection rate
Time to publish <48h idea → published post

Open Questions

  1. Privacy scope: Include Aundrea's terminal logs? Kids' commands? (Recommend: Matt-only initially)
  2. Sensitive data: Auto-redact passwords, tokens, IP addresses? (Recommend: yes, regex patterns)
  3. Frequency: Real-time (every log write) or batched (daily digest)? (Recommend: daily digest to avoid spam)
  4. Veto power: Can Matt permanently blacklist patterns? (Recommend: yes, ~/.config/idea-scraper/ignore-patterns.txt)

Implementation Phases

Phase 0: Foundation (Socrates)

  • SQLite schema for content_ideas
  • Log parser for bash history
  • Pattern matcher (regex only)
  • Basic Telegram notification

Phase 1: Intelligence (Socrates)

  • Local LLM scorer (phi4:14b)
  • Memory file parsing
  • Git commit scanning
  • Daily digest format

Phase 2: Refinement (Both)

  • Matt feedback loop
  • False positive tuning
  • Category auto-suggestion
  • Style matching to existing posts

Phase 3: Polish (Wadsworth)

  • Heartbeat integration
  • Weekly "best of" summaries
  • Archive/reject pattern learning
  • Export to other formats (Twitter threads, etc.)

Next Step: Socrates reviews feasibility after completing Content Pipeline v2 Phase 1.5


Requested by: Matt
Documented by: Wadsworth 📋