πŸ“„ content-generation-pipeline-spec.md 19,228 bytes Apr 21, 2026 πŸ“‹ Raw

Content Generation Pipeline β€” Technical Specification

Author: Daedalus 🎨
Date: 2026-04-21 (revised)
Status: APPROVED β€” Ready for Implementation
Assignee: Socrates 🧠 (Backend/Infrastructure)


Revision Notes

2026-04-21: Addressed Wadsworth review feedback:
- βœ… Reordered stages: Structure now validates angle before Draft (catch problems early)
- βœ… Updated latency: ~2.4 hours realistic runtime (not 12 minutes)
- βœ… Added overnight/batch processing pattern
- βœ… Added Gaming PC availability check + wake-on-LAN support


Executive Summary

A tiered model routing system for HoffDesk blog content generation. Cloud models (GLM 5.1) handle strategy and polish. Local models (Gaming PC 3080 Ti) handle volume draft generation. Target: 10 posts/month at ~$3 cloud cost vs ~$30 all-cloud.


Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         CONTENT PIPELINE                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  STAGE 1: STRATEGY       STAGE 2: STRUCTURE      STAGE 3: DRAFT       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚  GLM 5.1    β”‚         β”‚  GLM 5.1    β”‚        β”‚  Local 32B  β”‚       β”‚
β”‚  β”‚  (Cloud)    β”‚         β”‚  (Cloud)    β”‚        β”‚  (Gaming PC)β”‚       β”‚
β”‚  β”‚  ~2K tokens β”‚         β”‚  ~3K tokens β”‚        β”‚  ~80K tokensβ”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚         β”‚                       β”‚                    β”‚               β”‚
β”‚         β–Ό                       β–Ό                    β–Ό               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚  Brief +    β”‚         β”‚  Validated  β”‚        β”‚  Draft      β”‚       β”‚
β”‚  β”‚  Outline    │────────▢│  Structure  │───────▢│  (local)    β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚                                                        β”‚               β”‚
β”‚         STAGE 5: POLISH       STAGE 4: REVISION        β”‚               β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚               β”‚
β”‚         β”‚  GLM 5.1    β”‚       β”‚  Local 32B  β”‚          β”‚               β”‚
β”‚         β”‚  (Cloud)    β”‚       β”‚  (Gaming PC)β”‚β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚         β”‚  ~2K tokens β”‚       β”‚  ~40K tokensβ”‚                          β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β”‚                β”‚                                                     β”‚
β”‚                β–Ό                                                     β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                              β”‚
β”‚         β”‚  Final      β”‚                                              β”‚
β”‚         β”‚  Post       β”‚                                              β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                              β”‚
β”‚                                                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stage Order Rationale: Structure comes before Draft to validate the angle with cheap cloud inference before burning 90 minutes of GPU time on a draft that might be off-target. Catch direction problems early.


Hardware Requirements

Gaming PC (Local Inference Target)

Component Requirement Notes
GPU RTX 3080 Ti (10GB VRAM) βœ… Already available
RAM 32GB+ system RAM For model offloading if needed
Storage 50GB free SSD Model weights + cache
Network Tailscale connected Route from titanium-butler
OS Windows 11 or Linux Socrates preference

Software Stack

LocalAI provides OpenAI-compatible API, runs on CPU/GPU, supports multiple backends.

# local-ai-config.yaml
localai:
  api:
    bind: "0.0.0.0:8080"
    cors: true

  models:
    # Primary draft model - Qwen2.5 32B 4-bit
    - name: "draft-generator"
      backend: "llama-cpp"
      model: "/models/qwen2.5-32b-instruct-q4_k_m.gguf"
      context_size: 8192
      threads: 8
      f16: true
      gpu_layers: 35  # Max for 10GB VRAM

    # Fallback for shorter tasks
    - name: "fast-revision"
      backend: "llama-cpp"
      model: "/models/llama-3.1-8b-instruct-q4_k_m.gguf"
      context_size: 4096
      threads: 8
      gpu_layers: 35

Option B: vLLM (Higher throughput, more VRAM)

If Qwen2.5-32B fits in 10GB with vLLM's efficient paging:

# vllm serve command
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-32B-Instruct-AWQ \
  --quantization awq \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000

Option C: Ollama (Simplest, less efficient)

# ollama setup
ollama pull qwen2.5:32b
ollama serve

Recommendation: Start with Ollama for rapid testing, migrate to LocalAI for production pipeline.


Tailscale Integration

Networking Setup

titanium-butler (Beelink)          Gaming PC (3080 Ti)
    β”‚                                    β”‚
    β”‚    Tailscale network (100.x.x.x)   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  LocalAI API    β”‚  ← http://gaming-pc:8080/v1
    β”‚  (or vLLM)      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Service Discovery

# pipeline/config.py
LOCAL_LLM_HOST = "gaming-pc.tailXXXX.ts.net"  # Tailscale hostname
LOCAL_LLM_PORT = 8080
LOCAL_LLM_API_KEY = "sk-local-no-key-required"  # LocalAI default

CLOUD_LLM_API_KEY = os.getenv("GLM_API_KEY")
CLOUD_LLM_BASE_URL = "https://api.glm.ai/v1"

Pipeline Implementation

# Stage Orchestrator
def stage_strategy(...): ...
def stage_structure(strategy_output): ...      # NEW: Validate before Draft
def stage_draft(validated_brief, content_type): ...
def stage_revision(draft_output, edit_notes): ...
def stage_polish(revision_output, content_type): ...

Stage Orchestrator

# content/pipeline.py
def run_pipeline(
    topic: str,
    content_type: Literal["roundup", "how_i_solved", "build_log", "essay"],
    memory_query: Optional[str] = None
) -> dict:
    """
    Execute 5-stage pipeline for content generation.
    REVISED ORDER: Structure validated before expensive Draft stage.

    Typical runtime: 2-3 hours (designed for overnight/batch processing).
    Returns final post with metadata and cost breakdown.
    """

    # Stage 1: Strategy (Cloud - GLM 5.1) - ~1 min
    strategy = stage_strategy(topic, content_type, memory_query)

    # Stage 2: Structure (Cloud - GLM 5.1) - ~1 min
    # Validate angle BEFORE burning GPU time
    structure = stage_structure(strategy.output_text)

    # Stage 3: Draft (Local - Gaming PC) - ~90 min
    draft = stage_draft(structure.validated_brief, content_type)

    # Stage 4: Revision (Local - Gaming PC) - ~45 min
    revision = stage_revision(draft.output_text, structure.edit_notes)

    # Stage 5: Polish (Cloud - GLM 5.1) - ~1 min
    final = stage_polish(revision.output_text, content_type)

    return {
        "final_post": final.output_text,
        # ... rest unchanged
    }

Local LLM Client

# content/local_client.py
import httpx
import subprocess
from typing import Iterator, Optional

class LocalLLMClient:
    """
    Client for local inference on Gaming PC via Tailscale.
    Handles connection retries, timeout, error fallback, and wake-on-LAN.
    """

    def __init__(
        self,
        base_url: str = "http://gaming-pc.tailXXXX.ts.net:8080",
        model: str = "draft-generator",
        timeout: int = 300,  # 5 min for long generation
        tailscale_hostname: str = "gaming-pc"
    ):
        self.client = httpx.Client(base_url=base_url, timeout=timeout)
        self.model = model
        self.tailscale_hostname = tailscale_hostname

    async def check_availability(self) -> dict:
        """
        Check if Gaming PC is online and model is loaded.
        Returns: {"available": bool, "status": str, "wake_lan_possible": bool}
        """
        try:
            response = await self.client.get("/ready", timeout=5.0)
            if response.status_code == 200:
                return {
                    "available": True,
                    "status": "ready",
                    "wake_lan_possible": False
                }
        except (httpx.ConnectError, httpx.TimeoutException):
            # Check if host is reachable via ping
            ping_result = subprocess.run(
                ["ping", "-c", "1", "-W", "2", self.tailscale_hostname],
                capture_output=True
            )
            if ping_result.returncode != 0:
                return {
                    "available": False,
                    "status": "offline",
                    "wake_lan_possible": True  # Could try WoL
                }
            return {
                "available": False,
                "status": "online_but_service_down",
                "wake_lan_possible": False
            }

    async def wake_if_needed(self, mac_address: Optional[str] = None) -> bool:
        """
        Attempt wake-on-LAN if Gaming PC is offline.
        Returns True if wake packet sent (doesn't guarantee boot).
        """
        if not mac_address:
            return False

        # Send WoL magic packet
        subprocess.run([
            "wakeonlan", mac_address
        ], check=False)

        logger.info(f"Wake-on-LAN sent to {mac_address}")
        return True

    async def generate_with_availability_check(
        self,
        prompt: str,
        max_tokens: int = 2048,
        temperature: float = 0.7,
        auto_wake: bool = False,
        mac_address: Optional[str] = None
    ) -> str:
        """
        Generate with pre-check. Optionally wake PC if offline.
        """
        status = await self.check_availability()

        if not status["available"]:
            if auto_wake and status["wake_lan_possible"] and mac_address:
                await self.wake_if_needed(mac_address)
                raise LocalLLMUnavailable(
                    "Gaming PC offline. Wake-on-LAN sent. Retry in 2 minutes."
                )
            else:
                logger.warning("Local LLM unavailable, falling back to cloud")
                return await cloud_client.generate(prompt, max_tokens, temperature)

        return await self.generate(prompt, max_tokens, temperature, stream=False)

class LocalLLMUnavailable(Exception):
    """Raised when local LLM is unreachable and fallback is disabled."""
    pass

Model Configuration

Model Size Quantization VRAM Speed Use Case
Qwen2.5-32B-Instruct 32B Q4_K_M ~9GB ~15 tok/s Primary draft gen
Llama-3.3-70B 70B Q4_K_M ~40GB N/A Won't fit, skip
Llama-3.1-8B 8B Q4_K_M ~5GB ~40 tok/s Fast revisions
DeepSeek-Coder-V2 16B Q4_K_M ~10GB ~20 tok/s Code-heavy posts

Download commands:

# Qwen2.5-32B (primary)
huggingface-cli download bartowski/Qwen2.5-32B-Instruct-GGUF \
  --include "*Q4_K_M.gguf" --local-dir ./models

# Llama-3.1-8B (fast fallback)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "*Q4_K_M.gguf" --local-dir ./models

Prompt Templates

Stage 1: Strategy (Cloud)

{# templates/prompts/stage_strategy.txt #}
You are a content strategist for a technical blog about home infrastructure,
AI agents, and the messy reality of building things.

TOPIC: {{ topic }}
CONTENT TYPE: {{ content_type }}

RECENT CONTEXT:
{{ memory_context }}

Your task: Create a brief and outline for a blog post.

OUTPUT FORMAT:
- Headline options (3)
- Angle: One sentence on why this matters now
- Key moment: The human trigger (e.g., "7 PM, Aundrea asks if internet is down")
- Structure: 5-7 bullet outline
- Technical details needed: What specific errors/configs/logs to include
- Key takeaway: What the reader learns

Be specific. Generic angles get rejected.

Stage 3: Draft (Local)

{# templates/prompts/stage_draft.txt #}
Write a blog post following this exact structure. Match the voice:
- Start with the human moment, not the solution
- Admit wrong turns before revealing the fix
- Use specific versions, error messages, timestamps
- End with actionable takeaways

BRIEF:
{{ strategy_output }}

TEMPLATE STRUCTURE:
{{ content_template }}

VOICE EXAMPLES:
{{ voice_samples }}

Write 800-1200 words. Do not use AI-speak. Do not say "leverage" or "unlock".

Stage 2: Structure (Cloud)

{# templates/prompts/stage_structure.txt #}
You are a developmental editor. Review this brief against good blog structure.

BRIEF:
{{ strategy_output }}

Provide:
1. Structural validation: Does this angle work? (yes/no + why)
2. Outline critique: Specific improvements to the 5-7 bullet structure
3. Red flags: Is the human moment clear? Is the takeaway actionable?

If the angle is weak, say so. Better to fail fast here than after 90 min of GPU time.

Batch Processing & Overnight Pattern

Given the 2-3 hour runtime, design for queue-based generation:

# content/scheduler.py
class OvernightPipelineScheduler:
    """
    Queue posts to generate during low-activity hours.
    Designed to run overnight while Gaming PC is idle.
    """

    def __init__(self, start_hour=22, end_hour=6):
        self.start_hour = start_hour  # 10 PM
        self.end_hour = end_hour       # 6 AM
        self.queue = []

    def add_to_queue(self, topic, content_type):
        self.queue.append({
            "topic": topic,
            "content_type": content_type,
            "added_at": datetime.now(),
            "status": "pending"
        })

    async def run_nightly_batch(self):
        """
        Check PC availability, wake if needed, process queue.
        """
        status = await local_client.check_availability()

        if not status["available"] and status["wake_lan_possible"]:
            await local_client.wake_if_needed(mac_address=GAMING_PC_MAC)
            logger.info("Gaming PC woken for overnight batch. Waiting 2 min...")
            await asyncio.sleep(120)  # Wait for boot

        for job in self.queue:
            if job["status"] == "pending":
                try:
                    result = await run_pipeline(
                        topic=job["topic"],
                        content_type=job["content_type"]
                    )
                    job["status"] = "completed"
                    job["result"] = result
                    job["completed_at"] = datetime.now()
                except Exception as e:
                    job["status"] = "failed"
                    job["error"] = str(e)

Usage pattern:

# Queue posts throughout the day
python -m content.queue --add "DNS incident April 2026" --type how_i_solved
python -m content.queue --add "OpenClaw tutorial: custom skills" --type tutorial

# Run overnight (cron job at 10 PM)
python -m content.queue --run-overnight

Error Handling & Fallbacks

Failure Mode Detection Response
Gaming PC offline Connection timeout Fallback to cloud, queue for retry
Local model OOM CUDA out of memory Switch to smaller model (8B), retry
Nonsense output Repetition detector Regenerate with higher temp, flag for review
Stage timeout 5min+ no response Cancel, fallback to cloud, alert
Cloud API rate limit 429 response Exponential backoff, queue for later

Monitoring & Metrics

Latency Estimates (Revised)

Reality check: Local inference at 15 tok/s Γ— 80K tokens = ~90 minutes for Draft alone.

Stage Model Tokens Speed Time
1. Strategy GLM 5.1 (cloud) 2K fast ~1 min
2. Structure GLM 5.1 (cloud) 3K fast ~1 min
3. Draft Qwen2.5-32B (local) 80K ~15 tok/s ~89 min
4. Revision Qwen2.5-32B (local) 40K ~15 tok/s ~44 min
5. Polish GLM 5.1 (cloud) 2K fast ~1 min
Total ~2.4 hours

Design implication: This is an overnight/batch processing pipeline, not an on-demand tool. Queue posts to generate while you sleep.


Deliverables Checklist

  • [ ] LocalAI or Ollama installed on Gaming PC
  • [ ] Qwen2.5-32B model downloaded and tested
  • [ ] Tailscale hostname configured and reachable from Beelink
  • [ ] content/ module created in hoffdesk-api
  • [ ] Local LLM client with fallback logic
  • [ ] Cloud LLM client (GLM 5.1)
  • [ ] Pipeline orchestrator with 5 stages
  • [ ] Prompt templates for all content types
  • [ ] Error handling and retry logic
  • [ ] Metrics collection
  • [ ] CLI command: python -m content.generate --topic "DNS incident" --type how_i_solved

Open Questions

  1. GPU availability: Is Gaming PC always on, or does it sleep? Wake-on-LAN support?
  2. Windows vs Linux: Socrates preference for local inference stack?
  3. Model download: OK to pull 18GB model from HuggingFace, or prefer torrent/manual?
  4. Queue persistence: Should failed jobs retry automatically or wait for manual trigger?

Document: shared/project-docs/blog/content-generation-pipeline-spec.md
Author: Daedalus 🎨
For: Socrates 🧠 (Implementation)
Director: Matt (Approval)