HoffGraft
Donor logic steering for local LLMs. Run a 32B model's reasoning patterns through a 7B–14B chassis — on a single consumer GPU.
Inspired by the Gator project's 35B graft architecture, adapted for Ollama + our existing inference stack.
How It Works
┌──────────────────────────────────────────────────────┐
│ PHASE 0: One-time extraction (~1-2 hrs, CPU-only) │
│ │
│ 32B Donor (Qwen2.5 32B Instruct Q3_K_M) │
│ → Loaded via mmap, CPU-only (~14 GB RAM) │
│ → 5 domains × N prompts each │
│ → Capture token-level probability biases │
│ → Save as compressed .npz (~1-5 MB) │
│ → Delete donor model — it's not needed anymore │
│ │
│ PHASE 1: Runtime steering (~100ms overhead) │
│ │
│ User prompt → Domain classifier → Bias lookup │
│ → Ollama generate(logit_bias={biases}) │
│ → 7B-14B chassis responds with donor-influenced │
│ token selection │
│ │
│ Chassis model stays in GPU VRAM at all times. │
└──────────────────────────────────────────────────────┘
The Insight
When a 32B model reasons about scheduling, it consistently favours certain tokens over others at decision boundaries (e.g., "the best time is..." vs "perhaps we could..."). These preferences can be captured as a bias vector and injected into a 7B model's sampling loop. The small model generates the text, but the big model steers the token choices — especially at critical decision points.
This gives you big-model reasoning quality at small-model latency and VRAM cost.
Domains
| Domain | Purpose | Example Queries |
|---|---|---|
scheduling |
Calendar conflicts, time estimation, coordination | "When should we schedule the dentist?" |
email_triage |
Classification, priority, routing | "What do I need to action in my inbox?" |
coordination |
Family messages, task extraction, shopping lists | "Can someone pick up Harper at 3?" |
content_generation |
Summaries, briefings, writing | "Generate today's morning briefing" |
analysis |
Debugging, root cause, tradeoffs | "Why is the API returning 500s?" |
Requirements
Extraction (one-time)
- 32 GB system RAM (for 32B Q3_K_M ~14 GB donor)
llama-cpp-python,numpy,scipy- 1-2 hours of CPU time
- No GPU needed for extraction
Runtime
- Any GPU that fits the chassis model (10 GB VRAM for 14B Q4_K_M)
numpy(fingerprints)ollama(chassis inference)- No extra deps — uses urllib for Ollama API calls
Installation
cd ~/.openclaw/shared/hoffgraft
# Runtime deps (minimal)
pip install numpy
# Extraction deps (only needed on the extraction machine)
pip install llama-cpp-python numpy scipy
Usage
Step 1: Download the Donor Model
On the machine with 32 GB RAM:
# ~14 GB download
wget https://huggingface.co/bartowski/Qwen2.5-32B-Instruct-GGUF/resolve/main/Qwen2.5-32B-Instruct-Q3_K_M.gguf \
-O models/donor.gguf
Step 2: Extract Fingerprints
python hoffgraft_extract.py \
--model models/donor.gguf \
--output fingerprints/hoffgraft_fingerprints.npz \
--prompts-per-domain 200 \
--n-ctx 2048
# Takes ~1-2 hours. Progress bar shows domain-by-domain.
# Output: ~1-5 MB .npz file
# Optional: delete donor to free 14 GB disk space
rm models/donor.gguf
Step 3: Use at Runtime
from hoffgraft_steer import HoffGraftSteerer
steerer = HoffGraftSteerer(
fingerprints_path="fingerprints/hoffgraft_fingerprints.npz",
chassis_model="qwen2.5-coder:14b", # or 7b, llama3.1:8b, etc.
ollama_host="http://127.0.0.1:11434",
)
result = steerer.steer("What's on my calendar tomorrow?")
print(result["domain"]) # "scheduling"
print(result["confidence"]) # 0.85
print(result["response"]) # The chassis-generated answer
Or as a CLI:
python hoffgraft_steer.py \
--fingerprints fingerprints/hoffgraft_fingerprints.npz \
--chassis qwen2.5-coder:7b \
--prompt "Debug why the API is returning 500 errors" \
--verbose
# Output:
# [domain=analysis confidence=0.92 biases=128 time=3421ms]
# The 500 errors are likely caused by...
Step 4: Integration with OpenClaw
Mount as a tool or replace direct Ollama calls:
# In your bot/agent code
steerer = HoffGraftSteerer("fingerprints/hoffgraft_fingerprints.npz")
def generate_smart(prompt: str) -> str:
result = steerer.steer(prompt)
return result["response"]
Testing Before Extraction
The steering module works without fingerprints (just falls back to unbiased generation):
# Even with an empty/non-existent fingerprint file, classification works:
python hoffgraft_steer.py \
--fingerprints /dev/null \
--chassis qwen2.5-coder:7b \
--prompt "Schedule a dentist appointment" \
--stats
Our Gaming PC Fit
| Resource | Required | Available (3080 Ti + 32GB) |
|---|---|---|
| Extraction RAM | ~16 GB | 32 GB ✅ |
| Chassis VRAM (14B) | ~9 GB | 10 GB ✅ |
| Chassis VRAM (7B) | ~5 GB | 10 GB ✅ (lots of room) |
| Fingerprint disk | ~5 MB | negligible ✅ |
Files
hoffgraft/
├── README.md
├── hoffgraft_extract.py # Phase 0: donor extraction
└── hoffgraft_steer.py # Phase 1: runtime steering
Differences from Gator
| Gator | HoffGraft | |
|---|---|---|
| Donor execution | C++ placeholder kernel | llama-cpp-python (real inference) |
| Donor model | 35B Q4_K_M | 32B Q3_K_M (fits 32 GB RAM) |
| Chassis model | 1.5B Q4_K_M | 7B–14B (your choice) |
| Bias capture | Final-token only | Decision-boundary tokens |
| Inference backend | Custom kernel + llama-server | Ollama API (standard) |
| Integration | Standalone system | Drop-in for existing Ollama setups |
| Memory system | LanceDB + SQLite | None (use your existing) |
| Persona engine | 6-axis trait system | None (use your existing) |
| Dashboard | HTMX FastAPI on :8080 | None (use your existing) |
Limitations
-
Keyword-based classification — The domain classifier is deterministic keyword matching. Works well for our 5 domains (~90% accuracy) but won't handle edge cases. Easy to upgrade to an embedding-based classifier if needed.
-
Extraction approximates confidence — Current confidence measurement uses temperature perturbation (checking if the same token appears at 0.1 vs 0.7 temp). A proper logprobs-based extraction would be more precise, but requires lower-level model access.
-
Single-domain per query — Currently applies one domain's biases per prompt. For multi-domain queries, biases would need blending (future work).
-
Untested — this is a prototype. Needs a real extraction run on the gaming PC with 32 GB RAM to validate.