Content Generation Pipeline β Technical Specification
Author: Daedalus π¨
Date: 2026-04-21 (revised)
Status: APPROVED β Ready for Implementation
Assignee: Socrates π§ (Backend/Infrastructure)
Revision Notes
2026-04-21: Addressed Wadsworth review feedback:
- β
Reordered stages: Structure now validates angle before Draft (catch problems early)
- β
Updated latency: ~2.4 hours realistic runtime (not 12 minutes)
- β
Added overnight/batch processing pattern
- β
Added Gaming PC availability check + wake-on-LAN support
Executive Summary
A tiered model routing system for HoffDesk blog content generation. Cloud models (GLM 5.1) handle strategy and polish. Local models (Gaming PC 3080 Ti) handle volume draft generation. Target: 10 posts/month at ~$3 cloud cost vs ~$30 all-cloud.
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONTENT PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β STAGE 1: STRATEGY STAGE 2: STRUCTURE STAGE 3: DRAFT β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β GLM 5.1 β β GLM 5.1 β β Local 32B β β
β β (Cloud) β β (Cloud) β β (Gaming PC)β β
β β ~2K tokens β β ~3K tokens β β ~80K tokensβ β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Brief + β β Validated β β Draft β β
β β Outline ββββββββββΆβ Structure βββββββββΆβ (local) β β
β βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ β
β β β
β STAGE 5: POLISH STAGE 4: REVISION β β
β βββββββββββββββ βββββββββββββββ β β
β β GLM 5.1 β β Local 32B β β β
β β (Cloud) β β (Gaming PC)ββββββββββββ β
β β ~2K tokens β β ~40K tokensβ β
β ββββββββ¬βββββββ βββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββ β
β β Final β β
β β Post β β
β βββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Stage Order Rationale: Structure comes before Draft to validate the angle with cheap cloud inference before burning 90 minutes of GPU time on a draft that might be off-target. Catch direction problems early.
Hardware Requirements
Gaming PC (Local Inference Target)
| Component | Requirement | Notes |
|---|---|---|
| GPU | RTX 3080 Ti (10GB VRAM) | β Already available |
| RAM | 32GB+ system RAM | For model offloading if needed |
| Storage | 50GB free SSD | Model weights + cache |
| Network | Tailscale connected | Route from titanium-butler |
| OS | Windows 11 or Linux | Socrates preference |
Software Stack
Option A: LocalAI (Recommended)
LocalAI provides OpenAI-compatible API, runs on CPU/GPU, supports multiple backends.
# local-ai-config.yaml
localai:
api:
bind: "0.0.0.0:8080"
cors: true
models:
# Primary draft model - Qwen2.5 32B 4-bit
- name: "draft-generator"
backend: "llama-cpp"
model: "/models/qwen2.5-32b-instruct-q4_k_m.gguf"
context_size: 8192
threads: 8
f16: true
gpu_layers: 35 # Max for 10GB VRAM
# Fallback for shorter tasks
- name: "fast-revision"
backend: "llama-cpp"
model: "/models/llama-3.1-8b-instruct-q4_k_m.gguf"
context_size: 4096
threads: 8
gpu_layers: 35
Option B: vLLM (Higher throughput, more VRAM)
If Qwen2.5-32B fits in 10GB with vLLM's efficient paging:
# vllm serve command
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-32B-Instruct-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--port 8000
Option C: Ollama (Simplest, less efficient)
# ollama setup
ollama pull qwen2.5:32b
ollama serve
Recommendation: Start with Ollama for rapid testing, migrate to LocalAI for production pipeline.
Tailscale Integration
Networking Setup
titanium-butler (Beelink) Gaming PC (3080 Ti)
β β
β Tailscale network (100.x.x.x) β
ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β LocalAI API β β http://gaming-pc:8080/v1
β (or vLLM) β
βββββββββββββββββββ
Service Discovery
# pipeline/config.py
LOCAL_LLM_HOST = "gaming-pc.tailXXXX.ts.net" # Tailscale hostname
LOCAL_LLM_PORT = 8080
LOCAL_LLM_API_KEY = "sk-local-no-key-required" # LocalAI default
CLOUD_LLM_API_KEY = os.getenv("GLM_API_KEY")
CLOUD_LLM_BASE_URL = "https://api.glm.ai/v1"
Pipeline Implementation
# Stage Orchestrator
def stage_strategy(...): ...
def stage_structure(strategy_output): ... # NEW: Validate before Draft
def stage_draft(validated_brief, content_type): ...
def stage_revision(draft_output, edit_notes): ...
def stage_polish(revision_output, content_type): ...
Stage Orchestrator
# content/pipeline.py
def run_pipeline(
topic: str,
content_type: Literal["roundup", "how_i_solved", "build_log", "essay"],
memory_query: Optional[str] = None
) -> dict:
"""
Execute 5-stage pipeline for content generation.
REVISED ORDER: Structure validated before expensive Draft stage.
Typical runtime: 2-3 hours (designed for overnight/batch processing).
Returns final post with metadata and cost breakdown.
"""
# Stage 1: Strategy (Cloud - GLM 5.1) - ~1 min
strategy = stage_strategy(topic, content_type, memory_query)
# Stage 2: Structure (Cloud - GLM 5.1) - ~1 min
# Validate angle BEFORE burning GPU time
structure = stage_structure(strategy.output_text)
# Stage 3: Draft (Local - Gaming PC) - ~90 min
draft = stage_draft(structure.validated_brief, content_type)
# Stage 4: Revision (Local - Gaming PC) - ~45 min
revision = stage_revision(draft.output_text, structure.edit_notes)
# Stage 5: Polish (Cloud - GLM 5.1) - ~1 min
final = stage_polish(revision.output_text, content_type)
return {
"final_post": final.output_text,
# ... rest unchanged
}
Local LLM Client
# content/local_client.py
import httpx
import subprocess
from typing import Iterator, Optional
class LocalLLMClient:
"""
Client for local inference on Gaming PC via Tailscale.
Handles connection retries, timeout, error fallback, and wake-on-LAN.
"""
def __init__(
self,
base_url: str = "http://gaming-pc.tailXXXX.ts.net:8080",
model: str = "draft-generator",
timeout: int = 300, # 5 min for long generation
tailscale_hostname: str = "gaming-pc"
):
self.client = httpx.Client(base_url=base_url, timeout=timeout)
self.model = model
self.tailscale_hostname = tailscale_hostname
async def check_availability(self) -> dict:
"""
Check if Gaming PC is online and model is loaded.
Returns: {"available": bool, "status": str, "wake_lan_possible": bool}
"""
try:
response = await self.client.get("/ready", timeout=5.0)
if response.status_code == 200:
return {
"available": True,
"status": "ready",
"wake_lan_possible": False
}
except (httpx.ConnectError, httpx.TimeoutException):
# Check if host is reachable via ping
ping_result = subprocess.run(
["ping", "-c", "1", "-W", "2", self.tailscale_hostname],
capture_output=True
)
if ping_result.returncode != 0:
return {
"available": False,
"status": "offline",
"wake_lan_possible": True # Could try WoL
}
return {
"available": False,
"status": "online_but_service_down",
"wake_lan_possible": False
}
async def wake_if_needed(self, mac_address: Optional[str] = None) -> bool:
"""
Attempt wake-on-LAN if Gaming PC is offline.
Returns True if wake packet sent (doesn't guarantee boot).
"""
if not mac_address:
return False
# Send WoL magic packet
subprocess.run([
"wakeonlan", mac_address
], check=False)
logger.info(f"Wake-on-LAN sent to {mac_address}")
return True
async def generate_with_availability_check(
self,
prompt: str,
max_tokens: int = 2048,
temperature: float = 0.7,
auto_wake: bool = False,
mac_address: Optional[str] = None
) -> str:
"""
Generate with pre-check. Optionally wake PC if offline.
"""
status = await self.check_availability()
if not status["available"]:
if auto_wake and status["wake_lan_possible"] and mac_address:
await self.wake_if_needed(mac_address)
raise LocalLLMUnavailable(
"Gaming PC offline. Wake-on-LAN sent. Retry in 2 minutes."
)
else:
logger.warning("Local LLM unavailable, falling back to cloud")
return await cloud_client.generate(prompt, max_tokens, temperature)
return await self.generate(prompt, max_tokens, temperature, stream=False)
class LocalLLMUnavailable(Exception):
"""Raised when local LLM is unreachable and fallback is disabled."""
pass
Model Configuration
Recommended Models for 10GB VRAM
| Model | Size | Quantization | VRAM | Speed | Use Case |
|---|---|---|---|---|---|
| Qwen2.5-32B-Instruct | 32B | Q4_K_M | ~9GB | ~15 tok/s | Primary draft gen |
| Llama-3.3-70B | 70B | Q4_K_M | ~40GB | N/A | Won't fit, skip |
| Llama-3.1-8B | 8B | Q4_K_M | ~5GB | ~40 tok/s | Fast revisions |
| DeepSeek-Coder-V2 | 16B | Q4_K_M | ~10GB | ~20 tok/s | Code-heavy posts |
Download commands:
# Qwen2.5-32B (primary)
huggingface-cli download bartowski/Qwen2.5-32B-Instruct-GGUF \
--include "*Q4_K_M.gguf" --local-dir ./models
# Llama-3.1-8B (fast fallback)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
--include "*Q4_K_M.gguf" --local-dir ./models
Prompt Templates
Stage 1: Strategy (Cloud)
{# templates/prompts/stage_strategy.txt #}
You are a content strategist for a technical blog about home infrastructure,
AI agents, and the messy reality of building things.
TOPIC: {{ topic }}
CONTENT TYPE: {{ content_type }}
RECENT CONTEXT:
{{ memory_context }}
Your task: Create a brief and outline for a blog post.
OUTPUT FORMAT:
- Headline options (3)
- Angle: One sentence on why this matters now
- Key moment: The human trigger (e.g., "7 PM, Aundrea asks if internet is down")
- Structure: 5-7 bullet outline
- Technical details needed: What specific errors/configs/logs to include
- Key takeaway: What the reader learns
Be specific. Generic angles get rejected.
Stage 3: Draft (Local)
{# templates/prompts/stage_draft.txt #}
Write a blog post following this exact structure. Match the voice:
- Start with the human moment, not the solution
- Admit wrong turns before revealing the fix
- Use specific versions, error messages, timestamps
- End with actionable takeaways
BRIEF:
{{ strategy_output }}
TEMPLATE STRUCTURE:
{{ content_template }}
VOICE EXAMPLES:
{{ voice_samples }}
Write 800-1200 words. Do not use AI-speak. Do not say "leverage" or "unlock".
Stage 2: Structure (Cloud)
{# templates/prompts/stage_structure.txt #}
You are a developmental editor. Review this brief against good blog structure.
BRIEF:
{{ strategy_output }}
Provide:
1. Structural validation: Does this angle work? (yes/no + why)
2. Outline critique: Specific improvements to the 5-7 bullet structure
3. Red flags: Is the human moment clear? Is the takeaway actionable?
If the angle is weak, say so. Better to fail fast here than after 90 min of GPU time.
Batch Processing & Overnight Pattern
Given the 2-3 hour runtime, design for queue-based generation:
# content/scheduler.py
class OvernightPipelineScheduler:
"""
Queue posts to generate during low-activity hours.
Designed to run overnight while Gaming PC is idle.
"""
def __init__(self, start_hour=22, end_hour=6):
self.start_hour = start_hour # 10 PM
self.end_hour = end_hour # 6 AM
self.queue = []
def add_to_queue(self, topic, content_type):
self.queue.append({
"topic": topic,
"content_type": content_type,
"added_at": datetime.now(),
"status": "pending"
})
async def run_nightly_batch(self):
"""
Check PC availability, wake if needed, process queue.
"""
status = await local_client.check_availability()
if not status["available"] and status["wake_lan_possible"]:
await local_client.wake_if_needed(mac_address=GAMING_PC_MAC)
logger.info("Gaming PC woken for overnight batch. Waiting 2 min...")
await asyncio.sleep(120) # Wait for boot
for job in self.queue:
if job["status"] == "pending":
try:
result = await run_pipeline(
topic=job["topic"],
content_type=job["content_type"]
)
job["status"] = "completed"
job["result"] = result
job["completed_at"] = datetime.now()
except Exception as e:
job["status"] = "failed"
job["error"] = str(e)
Usage pattern:
# Queue posts throughout the day
python -m content.queue --add "DNS incident April 2026" --type how_i_solved
python -m content.queue --add "OpenClaw tutorial: custom skills" --type tutorial
# Run overnight (cron job at 10 PM)
python -m content.queue --run-overnight
Error Handling & Fallbacks
| Failure Mode | Detection | Response |
|---|---|---|
| Gaming PC offline | Connection timeout | Fallback to cloud, queue for retry |
| Local model OOM | CUDA out of memory | Switch to smaller model (8B), retry |
| Nonsense output | Repetition detector | Regenerate with higher temp, flag for review |
| Stage timeout | 5min+ no response | Cancel, fallback to cloud, alert |
| Cloud API rate limit | 429 response | Exponential backoff, queue for later |
Monitoring & Metrics
Latency Estimates (Revised)
Reality check: Local inference at 15 tok/s Γ 80K tokens = ~90 minutes for Draft alone.
| Stage | Model | Tokens | Speed | Time |
|---|---|---|---|---|
| 1. Strategy | GLM 5.1 (cloud) | 2K | fast | ~1 min |
| 2. Structure | GLM 5.1 (cloud) | 3K | fast | ~1 min |
| 3. Draft | Qwen2.5-32B (local) | 80K | ~15 tok/s | ~89 min |
| 4. Revision | Qwen2.5-32B (local) | 40K | ~15 tok/s | ~44 min |
| 5. Polish | GLM 5.1 (cloud) | 2K | fast | ~1 min |
| Total | ~2.4 hours |
Design implication: This is an overnight/batch processing pipeline, not an on-demand tool. Queue posts to generate while you sleep.
Deliverables Checklist
- [ ] LocalAI or Ollama installed on Gaming PC
- [ ] Qwen2.5-32B model downloaded and tested
- [ ] Tailscale hostname configured and reachable from Beelink
- [ ]
content/module created in hoffdesk-api - [ ] Local LLM client with fallback logic
- [ ] Cloud LLM client (GLM 5.1)
- [ ] Pipeline orchestrator with 5 stages
- [ ] Prompt templates for all content types
- [ ] Error handling and retry logic
- [ ] Metrics collection
- [ ] CLI command:
python -m content.generate --topic "DNS incident" --type how_i_solved
Open Questions
- GPU availability: Is Gaming PC always on, or does it sleep? Wake-on-LAN support?
- Windows vs Linux: Socrates preference for local inference stack?
- Model download: OK to pull 18GB model from HuggingFace, or prefer torrent/manual?
- Queue persistence: Should failed jobs retry automatically or wait for manual trigger?
Document: shared/project-docs/blog/content-generation-pipeline-spec.md
Author: Daedalus π¨
For: Socrates π§ (Implementation)
Director: Matt (Approval)