# Content Generation Pipeline β€” Technical Specification **Author:** Daedalus 🎨 **Date:** 2026-04-21 (revised) **Status:** **APPROVED** β€” Ready for Implementation **Assignee:** Socrates 🧠 (Backend/Infrastructure) --- ## Revision Notes **2026-04-21:** Addressed Wadsworth review feedback: - βœ… Reordered stages: Structure now validates angle before Draft (catch problems early) - βœ… Updated latency: ~2.4 hours realistic runtime (not 12 minutes) - βœ… Added overnight/batch processing pattern - βœ… Added Gaming PC availability check + wake-on-LAN support --- ## Executive Summary A tiered model routing system for HoffDesk blog content generation. Cloud models (GLM 5.1) handle strategy and polish. Local models (Gaming PC 3080 Ti) handle volume draft generation. Target: 10 posts/month at ~$3 cloud cost vs ~$30 all-cloud. --- ## Architecture Overview ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ CONTENT PIPELINE β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ STAGE 1: STRATEGY STAGE 2: STRUCTURE STAGE 3: DRAFT β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ GLM 5.1 β”‚ β”‚ GLM 5.1 β”‚ β”‚ Local 32B β”‚ β”‚ β”‚ β”‚ (Cloud) β”‚ β”‚ (Cloud) β”‚ β”‚ (Gaming PC)β”‚ β”‚ β”‚ β”‚ ~2K tokens β”‚ β”‚ ~3K tokens β”‚ β”‚ ~80K tokensβ”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β–Ό β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Brief + β”‚ β”‚ Validated β”‚ β”‚ Draft β”‚ β”‚ β”‚ β”‚ Outline │────────▢│ Structure │───────▢│ (local) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ STAGE 5: POLISH STAGE 4: REVISION β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ GLM 5.1 β”‚ β”‚ Local 32B β”‚ β”‚ β”‚ β”‚ β”‚ (Cloud) β”‚ β”‚ (Gaming PC)β”‚β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ ~2K tokens β”‚ β”‚ ~40K tokensβ”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Final β”‚ β”‚ β”‚ β”‚ Post β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` **Stage Order Rationale:** Structure comes before Draft to validate the angle with cheap cloud inference *before* burning 90 minutes of GPU time on a draft that might be off-target. Catch direction problems early. --- ## Hardware Requirements ### Gaming PC (Local Inference Target) | Component | Requirement | Notes | |-----------|-------------|-------| | **GPU** | RTX 3080 Ti (10GB VRAM) | βœ… Already available | | **RAM** | 32GB+ system RAM | For model offloading if needed | | **Storage** | 50GB free SSD | Model weights + cache | | **Network** | Tailscale connected | Route from titanium-butler | | **OS** | Windows 11 or Linux | Socrates preference | --- ## Software Stack ### Option A: LocalAI (Recommended) LocalAI provides OpenAI-compatible API, runs on CPU/GPU, supports multiple backends. ```yaml # local-ai-config.yaml localai: api: bind: "0.0.0.0:8080" cors: true models: # Primary draft model - Qwen2.5 32B 4-bit - name: "draft-generator" backend: "llama-cpp" model: "/models/qwen2.5-32b-instruct-q4_k_m.gguf" context_size: 8192 threads: 8 f16: true gpu_layers: 35 # Max for 10GB VRAM # Fallback for shorter tasks - name: "fast-revision" backend: "llama-cpp" model: "/models/llama-3.1-8b-instruct-q4_k_m.gguf" context_size: 4096 threads: 8 gpu_layers: 35 ``` ### Option B: vLLM (Higher throughput, more VRAM) If Qwen2.5-32B fits in 10GB with vLLM's efficient paging: ```python # vllm serve command python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-32B-Instruct-AWQ \ --quantization awq \ --tensor-parallel-size 1 \ --max-model-len 8192 \ --port 8000 ``` ### Option C: Ollama (Simplest, less efficient) ```bash # ollama setup ollama pull qwen2.5:32b ollama serve ``` **Recommendation:** Start with Ollama for rapid testing, migrate to LocalAI for production pipeline. --- ## Tailscale Integration ### Networking Setup ``` titanium-butler (Beelink) Gaming PC (3080 Ti) β”‚ β”‚ β”‚ Tailscale network (100.x.x.x) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ LocalAI API β”‚ ← http://gaming-pc:8080/v1 β”‚ (or vLLM) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### Service Discovery ```python # pipeline/config.py LOCAL_LLM_HOST = "gaming-pc.tailXXXX.ts.net" # Tailscale hostname LOCAL_LLM_PORT = 8080 LOCAL_LLM_API_KEY = "sk-local-no-key-required" # LocalAI default CLOUD_LLM_API_KEY = os.getenv("GLM_API_KEY") CLOUD_LLM_BASE_URL = "https://api.glm.ai/v1" ``` --- ## Pipeline Implementation ```python # Stage Orchestrator def stage_strategy(...): ... def stage_structure(strategy_output): ... # NEW: Validate before Draft def stage_draft(validated_brief, content_type): ... def stage_revision(draft_output, edit_notes): ... def stage_polish(revision_output, content_type): ... ``` ### Stage Orchestrator ```python # content/pipeline.py def run_pipeline( topic: str, content_type: Literal["roundup", "how_i_solved", "build_log", "essay"], memory_query: Optional[str] = None ) -> dict: """ Execute 5-stage pipeline for content generation. REVISED ORDER: Structure validated before expensive Draft stage. Typical runtime: 2-3 hours (designed for overnight/batch processing). Returns final post with metadata and cost breakdown. """ # Stage 1: Strategy (Cloud - GLM 5.1) - ~1 min strategy = stage_strategy(topic, content_type, memory_query) # Stage 2: Structure (Cloud - GLM 5.1) - ~1 min # Validate angle BEFORE burning GPU time structure = stage_structure(strategy.output_text) # Stage 3: Draft (Local - Gaming PC) - ~90 min draft = stage_draft(structure.validated_brief, content_type) # Stage 4: Revision (Local - Gaming PC) - ~45 min revision = stage_revision(draft.output_text, structure.edit_notes) # Stage 5: Polish (Cloud - GLM 5.1) - ~1 min final = stage_polish(revision.output_text, content_type) return { "final_post": final.output_text, # ... rest unchanged } ``` ### Local LLM Client ```python # content/local_client.py import httpx import subprocess from typing import Iterator, Optional class LocalLLMClient: """ Client for local inference on Gaming PC via Tailscale. Handles connection retries, timeout, error fallback, and wake-on-LAN. """ def __init__( self, base_url: str = "http://gaming-pc.tailXXXX.ts.net:8080", model: str = "draft-generator", timeout: int = 300, # 5 min for long generation tailscale_hostname: str = "gaming-pc" ): self.client = httpx.Client(base_url=base_url, timeout=timeout) self.model = model self.tailscale_hostname = tailscale_hostname async def check_availability(self) -> dict: """ Check if Gaming PC is online and model is loaded. Returns: {"available": bool, "status": str, "wake_lan_possible": bool} """ try: response = await self.client.get("/ready", timeout=5.0) if response.status_code == 200: return { "available": True, "status": "ready", "wake_lan_possible": False } except (httpx.ConnectError, httpx.TimeoutException): # Check if host is reachable via ping ping_result = subprocess.run( ["ping", "-c", "1", "-W", "2", self.tailscale_hostname], capture_output=True ) if ping_result.returncode != 0: return { "available": False, "status": "offline", "wake_lan_possible": True # Could try WoL } return { "available": False, "status": "online_but_service_down", "wake_lan_possible": False } async def wake_if_needed(self, mac_address: Optional[str] = None) -> bool: """ Attempt wake-on-LAN if Gaming PC is offline. Returns True if wake packet sent (doesn't guarantee boot). """ if not mac_address: return False # Send WoL magic packet subprocess.run([ "wakeonlan", mac_address ], check=False) logger.info(f"Wake-on-LAN sent to {mac_address}") return True async def generate_with_availability_check( self, prompt: str, max_tokens: int = 2048, temperature: float = 0.7, auto_wake: bool = False, mac_address: Optional[str] = None ) -> str: """ Generate with pre-check. Optionally wake PC if offline. """ status = await self.check_availability() if not status["available"]: if auto_wake and status["wake_lan_possible"] and mac_address: await self.wake_if_needed(mac_address) raise LocalLLMUnavailable( "Gaming PC offline. Wake-on-LAN sent. Retry in 2 minutes." ) else: logger.warning("Local LLM unavailable, falling back to cloud") return await cloud_client.generate(prompt, max_tokens, temperature) return await self.generate(prompt, max_tokens, temperature, stream=False) class LocalLLMUnavailable(Exception): """Raised when local LLM is unreachable and fallback is disabled.""" pass ``` --- ## Model Configuration ### Recommended Models for 10GB VRAM | Model | Size | Quantization | VRAM | Speed | Use Case | |-------|------|--------------|------|-------|----------| | **Qwen2.5-32B-Instruct** | 32B | Q4_K_M | ~9GB | ~15 tok/s | Primary draft gen | | **Llama-3.3-70B** | 70B | Q4_K_M | ~40GB | N/A | Won't fit, skip | | **Llama-3.1-8B** | 8B | Q4_K_M | ~5GB | ~40 tok/s | Fast revisions | | **DeepSeek-Coder-V2** | 16B | Q4_K_M | ~10GB | ~20 tok/s | Code-heavy posts | **Download commands:** ```bash # Qwen2.5-32B (primary) huggingface-cli download bartowski/Qwen2.5-32B-Instruct-GGUF \ --include "*Q4_K_M.gguf" --local-dir ./models # Llama-3.1-8B (fast fallback) huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \ --include "*Q4_K_M.gguf" --local-dir ./models ``` --- ## Prompt Templates ### Stage 1: Strategy (Cloud) ```jinja2 {# templates/prompts/stage_strategy.txt #} You are a content strategist for a technical blog about home infrastructure, AI agents, and the messy reality of building things. TOPIC: {{ topic }} CONTENT TYPE: {{ content_type }} RECENT CONTEXT: {{ memory_context }} Your task: Create a brief and outline for a blog post. OUTPUT FORMAT: - Headline options (3) - Angle: One sentence on why this matters now - Key moment: The human trigger (e.g., "7 PM, Aundrea asks if internet is down") - Structure: 5-7 bullet outline - Technical details needed: What specific errors/configs/logs to include - Key takeaway: What the reader learns Be specific. Generic angles get rejected. ``` ### Stage 3: Draft (Local) ```jinja2 {# templates/prompts/stage_draft.txt #} Write a blog post following this exact structure. Match the voice: - Start with the human moment, not the solution - Admit wrong turns before revealing the fix - Use specific versions, error messages, timestamps - End with actionable takeaways BRIEF: {{ strategy_output }} TEMPLATE STRUCTURE: {{ content_template }} VOICE EXAMPLES: {{ voice_samples }} Write 800-1200 words. Do not use AI-speak. Do not say "leverage" or "unlock". ``` ### Stage 2: Structure (Cloud) ```jinja2 {# templates/prompts/stage_structure.txt #} You are a developmental editor. Review this brief against good blog structure. BRIEF: {{ strategy_output }} Provide: 1. Structural validation: Does this angle work? (yes/no + why) 2. Outline critique: Specific improvements to the 5-7 bullet structure 3. Red flags: Is the human moment clear? Is the takeaway actionable? If the angle is weak, say so. Better to fail fast here than after 90 min of GPU time. ``` --- ## Batch Processing & Overnight Pattern Given the 2-3 hour runtime, design for queue-based generation: ```python # content/scheduler.py class OvernightPipelineScheduler: """ Queue posts to generate during low-activity hours. Designed to run overnight while Gaming PC is idle. """ def __init__(self, start_hour=22, end_hour=6): self.start_hour = start_hour # 10 PM self.end_hour = end_hour # 6 AM self.queue = [] def add_to_queue(self, topic, content_type): self.queue.append({ "topic": topic, "content_type": content_type, "added_at": datetime.now(), "status": "pending" }) async def run_nightly_batch(self): """ Check PC availability, wake if needed, process queue. """ status = await local_client.check_availability() if not status["available"] and status["wake_lan_possible"]: await local_client.wake_if_needed(mac_address=GAMING_PC_MAC) logger.info("Gaming PC woken for overnight batch. Waiting 2 min...") await asyncio.sleep(120) # Wait for boot for job in self.queue: if job["status"] == "pending": try: result = await run_pipeline( topic=job["topic"], content_type=job["content_type"] ) job["status"] = "completed" job["result"] = result job["completed_at"] = datetime.now() except Exception as e: job["status"] = "failed" job["error"] = str(e) ``` **Usage pattern:** ```bash # Queue posts throughout the day python -m content.queue --add "DNS incident April 2026" --type how_i_solved python -m content.queue --add "OpenClaw tutorial: custom skills" --type tutorial # Run overnight (cron job at 10 PM) python -m content.queue --run-overnight ``` --- ## Error Handling & Fallbacks | Failure Mode | Detection | Response | |--------------|-----------|----------| | Gaming PC offline | Connection timeout | Fallback to cloud, queue for retry | | Local model OOM | CUDA out of memory | Switch to smaller model (8B), retry | | Nonsense output | Repetition detector | Regenerate with higher temp, flag for review | | Stage timeout | 5min+ no response | Cancel, fallback to cloud, alert | | Cloud API rate limit | 429 response | Exponential backoff, queue for later | --- ## Monitoring & Metrics ## Latency Estimates (Revised) **Reality check:** Local inference at 15 tok/s Γ— 80K tokens = ~90 minutes for Draft alone. | Stage | Model | Tokens | Speed | Time | |-------|-------|--------|-------|------| | 1. Strategy | GLM 5.1 (cloud) | 2K | fast | ~1 min | | 2. Structure | GLM 5.1 (cloud) | 3K | fast | ~1 min | | 3. Draft | Qwen2.5-32B (local) | 80K | ~15 tok/s | **~89 min** | | 4. Revision | Qwen2.5-32B (local) | 40K | ~15 tok/s | **~44 min** | | 5. Polish | GLM 5.1 (cloud) | 2K | fast | ~1 min | | **Total** | | | | **~2.4 hours** | **Design implication:** This is an overnight/batch processing pipeline, not an on-demand tool. Queue posts to generate while you sleep. --- ## Deliverables Checklist - [ ] LocalAI or Ollama installed on Gaming PC - [ ] Qwen2.5-32B model downloaded and tested - [ ] Tailscale hostname configured and reachable from Beelink - [ ] `content/` module created in hoffdesk-api - [ ] Local LLM client with fallback logic - [ ] Cloud LLM client (GLM 5.1) - [ ] Pipeline orchestrator with 5 stages - [ ] Prompt templates for all content types - [ ] Error handling and retry logic - [ ] Metrics collection - [ ] CLI command: `python -m content.generate --topic "DNS incident" --type how_i_solved` --- ## Open Questions 1. **GPU availability**: Is Gaming PC always on, or does it sleep? Wake-on-LAN support? 2. **Windows vs Linux**: Socrates preference for local inference stack? 3. **Model download**: OK to pull 18GB model from HuggingFace, or prefer torrent/manual? 4. **Queue persistence**: Should failed jobs retry automatically or wait for manual trigger? --- *Document: `shared/project-docs/blog/content-generation-pipeline-spec.md`* *Author: Daedalus 🎨* *For: Socrates 🧠 (Implementation)* *Director: Matt (Approval)*