# Content Generation Pipeline — Technical Specification

**Author:** Daedalus 🎨  
**Date:** 2026-04-21 (revised)  
**Status:** **APPROVED** — Ready for Implementation  
**Assignee:** Socrates 🧠 (Backend/Infrastructure)

---

## Revision Notes

**2026-04-21:** Addressed Wadsworth review feedback:
- ✅ Reordered stages: Structure now validates angle before Draft (catch problems early)
- ✅ Updated latency: ~2.4 hours realistic runtime (not 12 minutes)
- ✅ Added overnight/batch processing pattern
- ✅ Added Gaming PC availability check + wake-on-LAN support

---

## Executive Summary

A tiered model routing system for HoffDesk blog content generation. Cloud models (GLM 5.1) handle strategy and polish. Local models (Gaming PC 3080 Ti) handle volume draft generation. Target: 10 posts/month at ~$3 cloud cost vs ~$30 all-cloud.

---

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────┐
│                         CONTENT PIPELINE                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  STAGE 1: STRATEGY       STAGE 2: STRUCTURE      STAGE 3: DRAFT       │
│  ┌─────────────┐         ┌─────────────┐        ┌─────────────┐       │
│  │  GLM 5.1    │         │  GLM 5.1    │        │  Local 32B  │       │
│  │  (Cloud)    │         │  (Cloud)    │        │  (Gaming PC)│       │
│  │  ~2K tokens │         │  ~3K tokens │        │  ~80K tokens│       │
│  └──────┬──────┘         └──────┬──────┘        └──────┬──────┘       │
│         │                       │                    │               │
│         ▼                       ▼                    ▼               │
│  ┌─────────────┐         ┌─────────────┐        ┌─────────────┐       │
│  │  Brief +    │         │  Validated  │        │  Draft      │       │
│  │  Outline    │────────▶│  Structure  │───────▶│  (local)    │       │
│  └─────────────┘         └─────────────┘        └──────┬──────┘       │
│                                                        │               │
│         STAGE 5: POLISH       STAGE 4: REVISION        │               │
│         ┌─────────────┐       ┌─────────────┐          │               │
│         │  GLM 5.1    │       │  Local 32B  │          │               │
│         │  (Cloud)    │       │  (Gaming PC)│◀─────────┘               │
│         │  ~2K tokens │       │  ~40K tokens│                          │
│         └──────┬──────┘       └─────────────┘                          │
│                │                                                     │
│                ▼                                                     │
│         ┌─────────────┐                                              │
│         │  Final      │                                              │
│         │  Post       │                                              │
│         └─────────────┘                                              │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
```

**Stage Order Rationale:** Structure comes before Draft to validate the angle with cheap cloud inference *before* burning 90 minutes of GPU time on a draft that might be off-target. Catch direction problems early.

---

## Hardware Requirements

### Gaming PC (Local Inference Target)

| Component | Requirement | Notes |
|-----------|-------------|-------|
| **GPU** | RTX 3080 Ti (10GB VRAM) | ✅ Already available |
| **RAM** | 32GB+ system RAM | For model offloading if needed |
| **Storage** | 50GB free SSD | Model weights + cache |
| **Network** | Tailscale connected | Route from titanium-butler |
| **OS** | Windows 11 or Linux | Socrates preference |

---

## Software Stack

### Option A: LocalAI (Recommended)

LocalAI provides OpenAI-compatible API, runs on CPU/GPU, supports multiple backends.

```yaml
# local-ai-config.yaml
localai:
  api:
    bind: "0.0.0.0:8080"
    cors: true
  
  models:
    # Primary draft model - Qwen2.5 32B 4-bit
    - name: "draft-generator"
      backend: "llama-cpp"
      model: "/models/qwen2.5-32b-instruct-q4_k_m.gguf"
      context_size: 8192
      threads: 8
      f16: true
      gpu_layers: 35  # Max for 10GB VRAM
      
    # Fallback for shorter tasks
    - name: "fast-revision"
      backend: "llama-cpp"
      model: "/models/llama-3.1-8b-instruct-q4_k_m.gguf"
      context_size: 4096
      threads: 8
      gpu_layers: 35
```

### Option B: vLLM (Higher throughput, more VRAM)

If Qwen2.5-32B fits in 10GB with vLLM's efficient paging:

```python
# vllm serve command
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-32B-Instruct-AWQ \
  --quantization awq \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000
```

### Option C: Ollama (Simplest, less efficient)

```bash
# ollama setup
ollama pull qwen2.5:32b
ollama serve
```

**Recommendation:** Start with Ollama for rapid testing, migrate to LocalAI for production pipeline.

---

## Tailscale Integration

### Networking Setup

```
titanium-butler (Beelink)          Gaming PC (3080 Ti)
    │                                    │
    │    Tailscale network (100.x.x.x)   │
    └────────────────────────────────────┘
              │
              ▼
    ┌─────────────────┐
    │  LocalAI API    │  ← http://gaming-pc:8080/v1
    │  (or vLLM)      │
    └─────────────────┘
```

### Service Discovery

```python
# pipeline/config.py
LOCAL_LLM_HOST = "gaming-pc.tailXXXX.ts.net"  # Tailscale hostname
LOCAL_LLM_PORT = 8080
LOCAL_LLM_API_KEY = "sk-local-no-key-required"  # LocalAI default

CLOUD_LLM_API_KEY = os.getenv("GLM_API_KEY")
CLOUD_LLM_BASE_URL = "https://api.glm.ai/v1"
```

---

## Pipeline Implementation

```python
# Stage Orchestrator
def stage_strategy(...): ...
def stage_structure(strategy_output): ...      # NEW: Validate before Draft
def stage_draft(validated_brief, content_type): ...
def stage_revision(draft_output, edit_notes): ...
def stage_polish(revision_output, content_type): ...
```

### Stage Orchestrator

```python
# content/pipeline.py
def run_pipeline(
    topic: str,
    content_type: Literal["roundup", "how_i_solved", "build_log", "essay"],
    memory_query: Optional[str] = None
) -> dict:
    """
    Execute 5-stage pipeline for content generation.
    REVISED ORDER: Structure validated before expensive Draft stage.
    
    Typical runtime: 2-3 hours (designed for overnight/batch processing).
    Returns final post with metadata and cost breakdown.
    """
    
    # Stage 1: Strategy (Cloud - GLM 5.1) - ~1 min
    strategy = stage_strategy(topic, content_type, memory_query)
    
    # Stage 2: Structure (Cloud - GLM 5.1) - ~1 min
    # Validate angle BEFORE burning GPU time
    structure = stage_structure(strategy.output_text)
    
    # Stage 3: Draft (Local - Gaming PC) - ~90 min
    draft = stage_draft(structure.validated_brief, content_type)
    
    # Stage 4: Revision (Local - Gaming PC) - ~45 min
    revision = stage_revision(draft.output_text, structure.edit_notes)
    
    # Stage 5: Polish (Cloud - GLM 5.1) - ~1 min
    final = stage_polish(revision.output_text, content_type)
    
    return {
        "final_post": final.output_text,
        # ... rest unchanged
    }
```

### Local LLM Client

```python
# content/local_client.py
import httpx
import subprocess
from typing import Iterator, Optional

class LocalLLMClient:
    """
    Client for local inference on Gaming PC via Tailscale.
    Handles connection retries, timeout, error fallback, and wake-on-LAN.
    """
    
    def __init__(
        self,
        base_url: str = "http://gaming-pc.tailXXXX.ts.net:8080",
        model: str = "draft-generator",
        timeout: int = 300,  # 5 min for long generation
        tailscale_hostname: str = "gaming-pc"
    ):
        self.client = httpx.Client(base_url=base_url, timeout=timeout)
        self.model = model
        self.tailscale_hostname = tailscale_hostname
        
    async def check_availability(self) -> dict:
        """
        Check if Gaming PC is online and model is loaded.
        Returns: {"available": bool, "status": str, "wake_lan_possible": bool}
        """
        try:
            response = await self.client.get("/ready", timeout=5.0)
            if response.status_code == 200:
                return {
                    "available": True,
                    "status": "ready",
                    "wake_lan_possible": False
                }
        except (httpx.ConnectError, httpx.TimeoutException):
            # Check if host is reachable via ping
            ping_result = subprocess.run(
                ["ping", "-c", "1", "-W", "2", self.tailscale_hostname],
                capture_output=True
            )
            if ping_result.returncode != 0:
                return {
                    "available": False,
                    "status": "offline",
                    "wake_lan_possible": True  # Could try WoL
                }
            return {
                "available": False,
                "status": "online_but_service_down",
                "wake_lan_possible": False
            }
            
    async def wake_if_needed(self, mac_address: Optional[str] = None) -> bool:
        """
        Attempt wake-on-LAN if Gaming PC is offline.
        Returns True if wake packet sent (doesn't guarantee boot).
        """
        if not mac_address:
            return False
            
        # Send WoL magic packet
        subprocess.run([
            "wakeonlan", mac_address
        ], check=False)
        
        logger.info(f"Wake-on-LAN sent to {mac_address}")
        return True
        
    async def generate_with_availability_check(
        self,
        prompt: str,
        max_tokens: int = 2048,
        temperature: float = 0.7,
        auto_wake: bool = False,
        mac_address: Optional[str] = None
    ) -> str:
        """
        Generate with pre-check. Optionally wake PC if offline.
        """
        status = await self.check_availability()
        
        if not status["available"]:
            if auto_wake and status["wake_lan_possible"] and mac_address:
                await self.wake_if_needed(mac_address)
                raise LocalLLMUnavailable(
                    "Gaming PC offline. Wake-on-LAN sent. Retry in 2 minutes."
                )
            else:
                logger.warning("Local LLM unavailable, falling back to cloud")
                return await cloud_client.generate(prompt, max_tokens, temperature)
        
        return await self.generate(prompt, max_tokens, temperature, stream=False)

class LocalLLMUnavailable(Exception):
    """Raised when local LLM is unreachable and fallback is disabled."""
    pass
```

---

## Model Configuration

### Recommended Models for 10GB VRAM

| Model | Size | Quantization | VRAM | Speed | Use Case |
|-------|------|--------------|------|-------|----------|
| **Qwen2.5-32B-Instruct** | 32B | Q4_K_M | ~9GB | ~15 tok/s | Primary draft gen |
| **Llama-3.3-70B** | 70B | Q4_K_M | ~40GB | N/A | Won't fit, skip |
| **Llama-3.1-8B** | 8B | Q4_K_M | ~5GB | ~40 tok/s | Fast revisions |
| **DeepSeek-Coder-V2** | 16B | Q4_K_M | ~10GB | ~20 tok/s | Code-heavy posts |

**Download commands:**
```bash
# Qwen2.5-32B (primary)
huggingface-cli download bartowski/Qwen2.5-32B-Instruct-GGUF \
  --include "*Q4_K_M.gguf" --local-dir ./models

# Llama-3.1-8B (fast fallback)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "*Q4_K_M.gguf" --local-dir ./models
```

---

## Prompt Templates

### Stage 1: Strategy (Cloud)

```jinja2
{# templates/prompts/stage_strategy.txt #}
You are a content strategist for a technical blog about home infrastructure,
AI agents, and the messy reality of building things.

TOPIC: {{ topic }}
CONTENT TYPE: {{ content_type }}

RECENT CONTEXT:
{{ memory_context }}

Your task: Create a brief and outline for a blog post.

OUTPUT FORMAT:
- Headline options (3)
- Angle: One sentence on why this matters now
- Key moment: The human trigger (e.g., "7 PM, Aundrea asks if internet is down")
- Structure: 5-7 bullet outline
- Technical details needed: What specific errors/configs/logs to include
- Key takeaway: What the reader learns

Be specific. Generic angles get rejected.
```

### Stage 3: Draft (Local)

```jinja2
{# templates/prompts/stage_draft.txt #}
Write a blog post following this exact structure. Match the voice:
- Start with the human moment, not the solution
- Admit wrong turns before revealing the fix
- Use specific versions, error messages, timestamps
- End with actionable takeaways

BRIEF:
{{ strategy_output }}

TEMPLATE STRUCTURE:
{{ content_template }}

VOICE EXAMPLES:
{{ voice_samples }}

Write 800-1200 words. Do not use AI-speak. Do not say "leverage" or "unlock".
```

### Stage 2: Structure (Cloud)

```jinja2
{# templates/prompts/stage_structure.txt #}
You are a developmental editor. Review this brief against good blog structure.

BRIEF:
{{ strategy_output }}

Provide:
1. Structural validation: Does this angle work? (yes/no + why)
2. Outline critique: Specific improvements to the 5-7 bullet structure
3. Red flags: Is the human moment clear? Is the takeaway actionable?

If the angle is weak, say so. Better to fail fast here than after 90 min of GPU time.
```

---

## Batch Processing & Overnight Pattern

Given the 2-3 hour runtime, design for queue-based generation:

```python
# content/scheduler.py
class OvernightPipelineScheduler:
    """
    Queue posts to generate during low-activity hours.
    Designed to run overnight while Gaming PC is idle.
    """
    
    def __init__(self, start_hour=22, end_hour=6):
        self.start_hour = start_hour  # 10 PM
        self.end_hour = end_hour       # 6 AM
        self.queue = []
        
    def add_to_queue(self, topic, content_type):
        self.queue.append({
            "topic": topic,
            "content_type": content_type,
            "added_at": datetime.now(),
            "status": "pending"
        })
        
    async def run_nightly_batch(self):
        """
        Check PC availability, wake if needed, process queue.
        """
        status = await local_client.check_availability()
        
        if not status["available"] and status["wake_lan_possible"]:
            await local_client.wake_if_needed(mac_address=GAMING_PC_MAC)
            logger.info("Gaming PC woken for overnight batch. Waiting 2 min...")
            await asyncio.sleep(120)  # Wait for boot
        
        for job in self.queue:
            if job["status"] == "pending":
                try:
                    result = await run_pipeline(
                        topic=job["topic"],
                        content_type=job["content_type"]
                    )
                    job["status"] = "completed"
                    job["result"] = result
                    job["completed_at"] = datetime.now()
                except Exception as e:
                    job["status"] = "failed"
                    job["error"] = str(e)
```

**Usage pattern:**
```bash
# Queue posts throughout the day
python -m content.queue --add "DNS incident April 2026" --type how_i_solved
python -m content.queue --add "OpenClaw tutorial: custom skills" --type tutorial

# Run overnight (cron job at 10 PM)
python -m content.queue --run-overnight
```

---

## Error Handling & Fallbacks

| Failure Mode | Detection | Response |
|--------------|-----------|----------|
| Gaming PC offline | Connection timeout | Fallback to cloud, queue for retry |
| Local model OOM | CUDA out of memory | Switch to smaller model (8B), retry |
| Nonsense output | Repetition detector | Regenerate with higher temp, flag for review |
| Stage timeout | 5min+ no response | Cancel, fallback to cloud, alert |
| Cloud API rate limit | 429 response | Exponential backoff, queue for later |

---

## Monitoring & Metrics

## Latency Estimates (Revised)

**Reality check:** Local inference at 15 tok/s × 80K tokens = ~90 minutes for Draft alone.

| Stage | Model | Tokens | Speed | Time |
|-------|-------|--------|-------|------|
| 1. Strategy | GLM 5.1 (cloud) | 2K | fast | ~1 min |
| 2. Structure | GLM 5.1 (cloud) | 3K | fast | ~1 min |
| 3. Draft | Qwen2.5-32B (local) | 80K | ~15 tok/s | **~89 min** |
| 4. Revision | Qwen2.5-32B (local) | 40K | ~15 tok/s | **~44 min** |
| 5. Polish | GLM 5.1 (cloud) | 2K | fast | ~1 min |
| **Total** | | | | **~2.4 hours** |

**Design implication:** This is an overnight/batch processing pipeline, not an on-demand tool. Queue posts to generate while you sleep.

---

## Deliverables Checklist

- [ ] LocalAI or Ollama installed on Gaming PC
- [ ] Qwen2.5-32B model downloaded and tested
- [ ] Tailscale hostname configured and reachable from Beelink
- [ ] `content/` module created in hoffdesk-api
- [ ] Local LLM client with fallback logic
- [ ] Cloud LLM client (GLM 5.1)
- [ ] Pipeline orchestrator with 5 stages
- [ ] Prompt templates for all content types
- [ ] Error handling and retry logic
- [ ] Metrics collection
- [ ] CLI command: `python -m content.generate --topic "DNS incident" --type how_i_solved`

---

## Open Questions

1. **GPU availability**: Is Gaming PC always on, or does it sleep? Wake-on-LAN support?
2. **Windows vs Linux**: Socrates preference for local inference stack?
3. **Model download**: OK to pull 18GB model from HuggingFace, or prefer torrent/manual?
4. **Queue persistence**: Should failed jobs retry automatically or wait for manual trigger?

---

*Document: `shared/project-docs/blog/content-generation-pipeline-spec.md`*  
*Author: Daedalus 🎨*  
*For: Socrates 🧠 (Implementation)*  
*Director: Matt (Approval)*