📄 content-generation-pipeline-spec.md 19,228 bytes Apr 21, 2026 📋 Raw

Content Generation Pipeline — Technical Specification

Author: Daedalus 🎨
Date: 2026-04-21 (revised)
Status: APPROVED — Ready for Implementation
Assignee: Socrates 🧠 (Backend/Infrastructure)

Revision Notes

2026-04-21: Addressed Wadsworth review feedback:
- ✅ Reordered stages: Structure now validates angle before Draft (catch problems early)
- ✅ Updated latency: ~2.4 hours realistic runtime (not 12 minutes)
- ✅ Added overnight/batch processing pattern
- ✅ Added Gaming PC availability check + wake-on-LAN support

Executive Summary

A tiered model routing system for HoffDesk blog content generation. Cloud models (GLM 5.1) handle strategy and polish. Local models (Gaming PC 3080 Ti) handle volume draft generation. Target: 10 posts/month at ~$3 cloud cost vs ~$30 all-cloud.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         CONTENT PIPELINE                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  STAGE 1: STRATEGY       STAGE 2: STRUCTURE      STAGE 3: DRAFT       │
│  ┌─────────────┐         ┌─────────────┐        ┌─────────────┐       │
│  │  GLM 5.1    │         │  GLM 5.1    │        │  Local 32B  │       │
│  │  (Cloud)    │         │  (Cloud)    │        │  (Gaming PC)│       │
│  │  ~2K tokens │         │  ~3K tokens │        │  ~80K tokens│       │
│  └──────┬──────┘         └──────┬──────┘        └──────┬──────┘       │
│         │                       │                    │               │
│         ▼                       ▼                    ▼               │
│  ┌─────────────┐         ┌─────────────┐        ┌─────────────┐       │
│  │  Brief +    │         │  Validated  │        │  Draft      │       │
│  │  Outline    │────────▶│  Structure  │───────▶│  (local)    │       │
│  └─────────────┘         └─────────────┘        └──────┬──────┘       │
│                                                        │               │
│         STAGE 5: POLISH       STAGE 4: REVISION        │               │
│         ┌─────────────┐       ┌─────────────┐          │               │
│         │  GLM 5.1    │       │  Local 32B  │          │               │
│         │  (Cloud)    │       │  (Gaming PC)│◀─────────┘               │
│         │  ~2K tokens │       │  ~40K tokens│                          │
│         └──────┬──────┘       └─────────────┘                          │
│                │                                                     │
│                ▼                                                     │
│         ┌─────────────┐                                              │
│         │  Final      │                                              │
│         │  Post       │                                              │
│         └─────────────┘                                              │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Stage Order Rationale: Structure comes before Draft to validate the angle with cheap cloud inference before burning 90 minutes of GPU time on a draft that might be off-target. Catch direction problems early.

Hardware Requirements

Gaming PC (Local Inference Target)

Component	Requirement	Notes
GPU	RTX 3080 Ti (10GB VRAM)	✅ Already available
RAM	32GB+ system RAM	For model offloading if needed
Storage	50GB free SSD	Model weights + cache
Network	Tailscale connected	Route from titanium-butler
OS	Windows 11 or Linux	Socrates preference

Software Stack

Option A: LocalAI (Recommended)

LocalAI provides OpenAI-compatible API, runs on CPU/GPU, supports multiple backends.

# local-ai-config.yaml
localai:
  api:
    bind: "0.0.0.0:8080"
    cors: true

  models:
    # Primary draft model - Qwen2.5 32B 4-bit
    - name: "draft-generator"
      backend: "llama-cpp"
      model: "/models/qwen2.5-32b-instruct-q4_k_m.gguf"
      context_size: 8192
      threads: 8
      f16: true
      gpu_layers: 35  # Max for 10GB VRAM

    # Fallback for shorter tasks
    - name: "fast-revision"
      backend: "llama-cpp"
      model: "/models/llama-3.1-8b-instruct-q4_k_m.gguf"
      context_size: 4096
      threads: 8
      gpu_layers: 35

Option B: vLLM (Higher throughput, more VRAM)

If Qwen2.5-32B fits in 10GB with vLLM's efficient paging:

# vllm serve command
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-32B-Instruct-AWQ \
  --quantization awq \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000

Option C: Ollama (Simplest, less efficient)

# ollama setup
ollama pull qwen2.5:32b
ollama serve

Recommendation: Start with Ollama for rapid testing, migrate to LocalAI for production pipeline.

Tailscale Integration

Networking Setup

titanium-butler (Beelink)          Gaming PC (3080 Ti)
    │                                    │
    │    Tailscale network (100.x.x.x)   │
    └────────────────────────────────────┘
              │
              ▼
    ┌─────────────────┐
    │  LocalAI API    │  ← http://gaming-pc:8080/v1
    │  (or vLLM)      │
    └─────────────────┘

Service Discovery

# pipeline/config.py
LOCAL_LLM_HOST = "gaming-pc.tailXXXX.ts.net"  # Tailscale hostname
LOCAL_LLM_PORT = 8080
LOCAL_LLM_API_KEY = "sk-local-no-key-required"  # LocalAI default

CLOUD_LLM_API_KEY = os.getenv("GLM_API_KEY")
CLOUD_LLM_BASE_URL = "https://api.glm.ai/v1"

Pipeline Implementation

# Stage Orchestrator
def stage_strategy(...): ...
def stage_structure(strategy_output): ...      # NEW: Validate before Draft
def stage_draft(validated_brief, content_type): ...
def stage_revision(draft_output, edit_notes): ...
def stage_polish(revision_output, content_type): ...

Stage Orchestrator

# content/pipeline.py
def run_pipeline(
    topic: str,
    content_type: Literal["roundup", "how_i_solved", "build_log", "essay"],
    memory_query: Optional[str] = None
) -> dict:
    """
    Execute 5-stage pipeline for content generation.
    REVISED ORDER: Structure validated before expensive Draft stage.

    Typical runtime: 2-3 hours (designed for overnight/batch processing).
    Returns final post with metadata and cost breakdown.
    """

    # Stage 1: Strategy (Cloud - GLM 5.1) - ~1 min
    strategy = stage_strategy(topic, content_type, memory_query)

    # Stage 2: Structure (Cloud - GLM 5.1) - ~1 min
    # Validate angle BEFORE burning GPU time
    structure = stage_structure(strategy.output_text)

    # Stage 3: Draft (Local - Gaming PC) - ~90 min
    draft = stage_draft(structure.validated_brief, content_type)

    # Stage 4: Revision (Local - Gaming PC) - ~45 min
    revision = stage_revision(draft.output_text, structure.edit_notes)

    # Stage 5: Polish (Cloud - GLM 5.1) - ~1 min
    final = stage_polish(revision.output_text, content_type)

    return {
        "final_post": final.output_text,
        # ... rest unchanged
    }

Local LLM Client

# content/local_client.py
import httpx
import subprocess
from typing import Iterator, Optional

class LocalLLMClient:
    """
    Client for local inference on Gaming PC via Tailscale.
    Handles connection retries, timeout, error fallback, and wake-on-LAN.
    """

    def __init__(
        self,
        base_url: str = "http://gaming-pc.tailXXXX.ts.net:8080",
        model: str = "draft-generator",
        timeout: int = 300,  # 5 min for long generation
        tailscale_hostname: str = "gaming-pc"
    ):
        self.client = httpx.Client(base_url=base_url, timeout=timeout)
        self.model = model
        self.tailscale_hostname = tailscale_hostname

    async def check_availability(self) -> dict:
        """
        Check if Gaming PC is online and model is loaded.
        Returns: {"available": bool, "status": str, "wake_lan_possible": bool}
        """
        try:
            response = await self.client.get("/ready", timeout=5.0)
            if response.status_code == 200:
                return {
                    "available": True,
                    "status": "ready",
                    "wake_lan_possible": False
                }
        except (httpx.ConnectError, httpx.TimeoutException):
            # Check if host is reachable via ping
            ping_result = subprocess.run(
                ["ping", "-c", "1", "-W", "2", self.tailscale_hostname],
                capture_output=True
            )
            if ping_result.returncode != 0:
                return {
                    "available": False,
                    "status": "offline",
                    "wake_lan_possible": True  # Could try WoL
                }
            return {
                "available": False,
                "status": "online_but_service_down",
                "wake_lan_possible": False
            }

    async def wake_if_needed(self, mac_address: Optional[str] = None) -> bool:
        """
        Attempt wake-on-LAN if Gaming PC is offline.
        Returns True if wake packet sent (doesn't guarantee boot).
        """
        if not mac_address:
            return False

        # Send WoL magic packet
        subprocess.run([
            "wakeonlan", mac_address
        ], check=False)

        logger.info(f"Wake-on-LAN sent to {mac_address}")
        return True

    async def generate_with_availability_check(
        self,
        prompt: str,
        max_tokens: int = 2048,
        temperature: float = 0.7,
        auto_wake: bool = False,
        mac_address: Optional[str] = None
    ) -> str:
        """
        Generate with pre-check. Optionally wake PC if offline.
        """
        status = await self.check_availability()

        if not status["available"]:
            if auto_wake and status["wake_lan_possible"] and mac_address:
                await self.wake_if_needed(mac_address)
                raise LocalLLMUnavailable(
                    "Gaming PC offline. Wake-on-LAN sent. Retry in 2 minutes."
                )
            else:
                logger.warning("Local LLM unavailable, falling back to cloud")
                return await cloud_client.generate(prompt, max_tokens, temperature)

        return await self.generate(prompt, max_tokens, temperature, stream=False)

class LocalLLMUnavailable(Exception):
    """Raised when local LLM is unreachable and fallback is disabled."""
    pass

Model Configuration

Recommended Models for 10GB VRAM

Model	Size	Quantization	VRAM	Speed	Use Case
Qwen2.5-32B-Instruct	32B	Q4_K_M	~9GB	~15 tok/s	Primary draft gen
Llama-3.3-70B	70B	Q4_K_M	~40GB	N/A	Won't fit, skip
Llama-3.1-8B	8B	Q4_K_M	~5GB	~40 tok/s	Fast revisions
DeepSeek-Coder-V2	16B	Q4_K_M	~10GB	~20 tok/s	Code-heavy posts

Download commands:

# Qwen2.5-32B (primary)
huggingface-cli download bartowski/Qwen2.5-32B-Instruct-GGUF \
  --include "*Q4_K_M.gguf" --local-dir ./models

# Llama-3.1-8B (fast fallback)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "*Q4_K_M.gguf" --local-dir ./models

Prompt Templates

Stage 1: Strategy (Cloud)

{# templates/prompts/stage_strategy.txt #}
You are a content strategist for a technical blog about home infrastructure,
AI agents, and the messy reality of building things.

TOPIC: {{ topic }}
CONTENT TYPE: {{ content_type }}

RECENT CONTEXT:
{{ memory_context }}

Your task: Create a brief and outline for a blog post.

OUTPUT FORMAT:
- Headline options (3)
- Angle: One sentence on why this matters now
- Key moment: The human trigger (e.g., "7 PM, Aundrea asks if internet is down")
- Structure: 5-7 bullet outline
- Technical details needed: What specific errors/configs/logs to include
- Key takeaway: What the reader learns

Be specific. Generic angles get rejected.

Stage 3: Draft (Local)

{# templates/prompts/stage_draft.txt #}
Write a blog post following this exact structure. Match the voice:
- Start with the human moment, not the solution
- Admit wrong turns before revealing the fix
- Use specific versions, error messages, timestamps
- End with actionable takeaways

BRIEF:
{{ strategy_output }}

TEMPLATE STRUCTURE:
{{ content_template }}

VOICE EXAMPLES:
{{ voice_samples }}

Write 800-1200 words. Do not use AI-speak. Do not say "leverage" or "unlock".

Stage 2: Structure (Cloud)

{# templates/prompts/stage_structure.txt #}
You are a developmental editor. Review this brief against good blog structure.

BRIEF:
{{ strategy_output }}

Provide:
1. Structural validation: Does this angle work? (yes/no + why)
2. Outline critique: Specific improvements to the 5-7 bullet structure
3. Red flags: Is the human moment clear? Is the takeaway actionable?

If the angle is weak, say so. Better to fail fast here than after 90 min of GPU time.

Batch Processing & Overnight Pattern

Given the 2-3 hour runtime, design for queue-based generation:

# content/scheduler.py
class OvernightPipelineScheduler:
    """
    Queue posts to generate during low-activity hours.
    Designed to run overnight while Gaming PC is idle.
    """

    def __init__(self, start_hour=22, end_hour=6):
        self.start_hour = start_hour  # 10 PM
        self.end_hour = end_hour       # 6 AM
        self.queue = []

    def add_to_queue(self, topic, content_type):
        self.queue.append({
            "topic": topic,
            "content_type": content_type,
            "added_at": datetime.now(),
            "status": "pending"
        })

    async def run_nightly_batch(self):
        """
        Check PC availability, wake if needed, process queue.
        """
        status = await local_client.check_availability()

        if not status["available"] and status["wake_lan_possible"]:
            await local_client.wake_if_needed(mac_address=GAMING_PC_MAC)
            logger.info("Gaming PC woken for overnight batch. Waiting 2 min...")
            await asyncio.sleep(120)  # Wait for boot

        for job in self.queue:
            if job["status"] == "pending":
                try:
                    result = await run_pipeline(
                        topic=job["topic"],
                        content_type=job["content_type"]
                    )
                    job["status"] = "completed"
                    job["result"] = result
                    job["completed_at"] = datetime.now()
                except Exception as e:
                    job["status"] = "failed"
                    job["error"] = str(e)

Usage pattern:

# Queue posts throughout the day
python -m content.queue --add "DNS incident April 2026" --type how_i_solved
python -m content.queue --add "OpenClaw tutorial: custom skills" --type tutorial

# Run overnight (cron job at 10 PM)
python -m content.queue --run-overnight

Error Handling & Fallbacks

Failure Mode	Detection	Response
Gaming PC offline	Connection timeout	Fallback to cloud, queue for retry
Local model OOM	CUDA out of memory	Switch to smaller model (8B), retry
Nonsense output	Repetition detector	Regenerate with higher temp, flag for review
Stage timeout	5min+ no response	Cancel, fallback to cloud, alert
Cloud API rate limit	429 response	Exponential backoff, queue for later

Monitoring & Metrics

Latency Estimates (Revised)

Reality check: Local inference at 15 tok/s × 80K tokens = ~90 minutes for Draft alone.

Stage	Model	Tokens	Speed	Time
1. Strategy	GLM 5.1 (cloud)	2K	fast	~1 min
2. Structure	GLM 5.1 (cloud)	3K	fast	~1 min
3. Draft	Qwen2.5-32B (local)	80K	~15 tok/s	~89 min
4. Revision	Qwen2.5-32B (local)	40K	~15 tok/s	~44 min
5. Polish	GLM 5.1 (cloud)	2K	fast	~1 min
Total				~2.4 hours

Design implication: This is an overnight/batch processing pipeline, not an on-demand tool. Queue posts to generate while you sleep.

Deliverables Checklist

[ ] LocalAI or Ollama installed on Gaming PC
[ ] Qwen2.5-32B model downloaded and tested
[ ] Tailscale hostname configured and reachable from Beelink
[ ] content/ module created in hoffdesk-api
[ ] Local LLM client with fallback logic
[ ] Cloud LLM client (GLM 5.1)
[ ] Pipeline orchestrator with 5 stages
[ ] Prompt templates for all content types
[ ] Error handling and retry logic
[ ] Metrics collection
[ ] CLI command: python -m content.generate --topic "DNS incident" --type how_i_solved

Open Questions

GPU availability: Is Gaming PC always on, or does it sleep? Wake-on-LAN support?
Windows vs Linux: Socrates preference for local inference stack?
Model download: OK to pull 18GB model from HuggingFace, or prefer torrent/manual?
Queue persistence: Should failed jobs retry automatically or wait for manual trigger?

Document: shared/project-docs/blog/content-generation-pipeline-spec.md
Author: Daedalus 🎨
For: Socrates 🧠 (Implementation)
Director: Matt (Approval)

← Back