# HoffGraft **Donor logic steering for local LLMs.** Run a 32B model's reasoning patterns through a 7B–14B chassis — on a single consumer GPU. Inspired by the [Gator project](https://github.com/Mexor-dev/Gator)'s 35B graft architecture, adapted for Ollama + our existing inference stack. ## How It Works ``` ┌──────────────────────────────────────────────────────┐ │ PHASE 0: One-time extraction (~1-2 hrs, CPU-only) │ │ │ │ 32B Donor (Qwen2.5 32B Instruct Q3_K_M) │ │ → Loaded via mmap, CPU-only (~14 GB RAM) │ │ → 5 domains × N prompts each │ │ → Capture token-level probability biases │ │ → Save as compressed .npz (~1-5 MB) │ │ → Delete donor model — it's not needed anymore │ │ │ │ PHASE 1: Runtime steering (~100ms overhead) │ │ │ │ User prompt → Domain classifier → Bias lookup │ │ → Ollama generate(logit_bias={biases}) │ │ → 7B-14B chassis responds with donor-influenced │ │ token selection │ │ │ │ Chassis model stays in GPU VRAM at all times. │ └──────────────────────────────────────────────────────┘ ``` ### The Insight When a 32B model reasons about scheduling, it consistently favours certain tokens over others at decision boundaries (e.g., "the best time is..." vs "perhaps we could..."). These preferences can be captured as a **bias vector** and injected into a 7B model's sampling loop. The small model generates the text, but the big model steers the token choices — especially at critical decision points. This gives you big-model reasoning quality at small-model latency and VRAM cost. ## Domains | Domain | Purpose | Example Queries | |--------|---------|----------------| | `scheduling` | Calendar conflicts, time estimation, coordination | "When should we schedule the dentist?" | | `email_triage` | Classification, priority, routing | "What do I need to action in my inbox?" | | `coordination` | Family messages, task extraction, shopping lists | "Can someone pick up Harper at 3?" | | `content_generation` | Summaries, briefings, writing | "Generate today's morning briefing" | | `analysis` | Debugging, root cause, tradeoffs | "Why is the API returning 500s?" | ## Requirements ### Extraction (one-time) - **32 GB system RAM** (for 32B Q3_K_M ~14 GB donor) - `llama-cpp-python`, `numpy`, `scipy` - 1-2 hours of CPU time - No GPU needed for extraction ### Runtime - **Any GPU that fits the chassis model** (10 GB VRAM for 14B Q4_K_M) - `numpy` (fingerprints) - `ollama` (chassis inference) - No extra deps — uses urllib for Ollama API calls ## Installation ```bash cd ~/.openclaw/shared/hoffgraft # Runtime deps (minimal) pip install numpy # Extraction deps (only needed on the extraction machine) pip install llama-cpp-python numpy scipy ``` ## Usage ### Step 1: Download the Donor Model On the machine with 32 GB RAM: ```bash # ~14 GB download wget https://huggingface.co/bartowski/Qwen2.5-32B-Instruct-GGUF/resolve/main/Qwen2.5-32B-Instruct-Q3_K_M.gguf \ -O models/donor.gguf ``` ### Step 2: Extract Fingerprints ```bash python hoffgraft_extract.py \ --model models/donor.gguf \ --output fingerprints/hoffgraft_fingerprints.npz \ --prompts-per-domain 200 \ --n-ctx 2048 # Takes ~1-2 hours. Progress bar shows domain-by-domain. # Output: ~1-5 MB .npz file # Optional: delete donor to free 14 GB disk space rm models/donor.gguf ``` ### Step 3: Use at Runtime ```python from hoffgraft_steer import HoffGraftSteerer steerer = HoffGraftSteerer( fingerprints_path="fingerprints/hoffgraft_fingerprints.npz", chassis_model="qwen2.5-coder:14b", # or 7b, llama3.1:8b, etc. ollama_host="http://127.0.0.1:11434", ) result = steerer.steer("What's on my calendar tomorrow?") print(result["domain"]) # "scheduling" print(result["confidence"]) # 0.85 print(result["response"]) # The chassis-generated answer ``` Or as a CLI: ```bash python hoffgraft_steer.py \ --fingerprints fingerprints/hoffgraft_fingerprints.npz \ --chassis qwen2.5-coder:7b \ --prompt "Debug why the API is returning 500 errors" \ --verbose # Output: # [domain=analysis confidence=0.92 biases=128 time=3421ms] # The 500 errors are likely caused by... ``` ### Step 4: Integration with OpenClaw Mount as a tool or replace direct Ollama calls: ```python # In your bot/agent code steerer = HoffGraftSteerer("fingerprints/hoffgraft_fingerprints.npz") def generate_smart(prompt: str) -> str: result = steerer.steer(prompt) return result["response"] ``` ## Testing Before Extraction The steering module works without fingerprints (just falls back to unbiased generation): ```bash # Even with an empty/non-existent fingerprint file, classification works: python hoffgraft_steer.py \ --fingerprints /dev/null \ --chassis qwen2.5-coder:7b \ --prompt "Schedule a dentist appointment" \ --stats ``` ## Our Gaming PC Fit | Resource | Required | Available (3080 Ti + 32GB) | |----------|----------|----------------------------| | Extraction RAM | ~16 GB | 32 GB ✅ | | Chassis VRAM (14B) | ~9 GB | 10 GB ✅ | | Chassis VRAM (7B) | ~5 GB | 10 GB ✅ (lots of room) | | Fingerprint disk | ~5 MB | negligible ✅ | ## Files ``` hoffgraft/ ├── README.md ├── hoffgraft_extract.py # Phase 0: donor extraction └── hoffgraft_steer.py # Phase 1: runtime steering ``` ## Differences from Gator | | Gator | HoffGraft | |---|---|---| | Donor execution | C++ placeholder kernel | llama-cpp-python (real inference) | | Donor model | 35B Q4_K_M | 32B Q3_K_M (fits 32 GB RAM) | | Chassis model | 1.5B Q4_K_M | 7B–14B (your choice) | | Bias capture | Final-token only | Decision-boundary tokens | | Inference backend | Custom kernel + llama-server | Ollama API (standard) | | Integration | Standalone system | Drop-in for existing Ollama setups | | Memory system | LanceDB + SQLite | None (use your existing) | | Persona engine | 6-axis trait system | None (use your existing) | | Dashboard | HTMX FastAPI on :8080 | None (use your existing) | ## Limitations 1. **Keyword-based classification** — The domain classifier is deterministic keyword matching. Works well for our 5 domains (~90% accuracy) but won't handle edge cases. Easy to upgrade to an embedding-based classifier if needed. 2. **Extraction approximates confidence** — Current confidence measurement uses temperature perturbation (checking if the same token appears at 0.1 vs 0.7 temp). A proper logprobs-based extraction would be more precise, but requires lower-level model access. 3. **Single-domain per query** — Currently applies one domain's biases per prompt. For multi-domain queries, biases would need blending (future work). 4. **Untested — this is a prototype.** Needs a real extraction run on the gaming PC with 32 GB RAM to validate.