📄 README.md 7,400 bytes Wednesday 03:34 📋 Raw

HoffGraft

Donor logic steering for local LLMs. Run a 32B model's reasoning patterns through a 7B–14B chassis — on a single consumer GPU.

Inspired by the Gator project's 35B graft architecture, adapted for Ollama + our existing inference stack.

How It Works

┌──────────────────────────────────────────────────────┐
│  PHASE 0: One-time extraction (~1-2 hrs, CPU-only)   │
│                                                      │
│  32B Donor (Qwen2.5 32B Instruct Q3_K_M)            │
│  → Loaded via mmap, CPU-only (~14 GB RAM)            │
│  → 5 domains × N prompts each                       │
│  → Capture token-level probability biases            │
│  → Save as compressed .npz (~1-5 MB)                │
│  → Delete donor model — it's not needed anymore      │
│                                                      │
│  PHASE 1: Runtime steering (~100ms overhead)         │
│                                                      │
│  User prompt → Domain classifier → Bias lookup       │
│  → Ollama generate(logit_bias={biases})              │
│  → 7B-14B chassis responds with donor-influenced     │
│    token selection                                    │
│                                                      │
│  Chassis model stays in GPU VRAM at all times.       │
└──────────────────────────────────────────────────────┘

The Insight

When a 32B model reasons about scheduling, it consistently favours certain tokens over others at decision boundaries (e.g., "the best time is..." vs "perhaps we could..."). These preferences can be captured as a bias vector and injected into a 7B model's sampling loop. The small model generates the text, but the big model steers the token choices — especially at critical decision points.

This gives you big-model reasoning quality at small-model latency and VRAM cost.

Domains

Domain Purpose Example Queries
scheduling Calendar conflicts, time estimation, coordination "When should we schedule the dentist?"
email_triage Classification, priority, routing "What do I need to action in my inbox?"
coordination Family messages, task extraction, shopping lists "Can someone pick up Harper at 3?"
content_generation Summaries, briefings, writing "Generate today's morning briefing"
analysis Debugging, root cause, tradeoffs "Why is the API returning 500s?"

Requirements

Extraction (one-time)

  • 32 GB system RAM (for 32B Q3_K_M ~14 GB donor)
  • llama-cpp-python, numpy, scipy
  • 1-2 hours of CPU time
  • No GPU needed for extraction

Runtime

  • Any GPU that fits the chassis model (10 GB VRAM for 14B Q4_K_M)
  • numpy (fingerprints)
  • ollama (chassis inference)
  • No extra deps — uses urllib for Ollama API calls

Installation

cd ~/.openclaw/shared/hoffgraft

# Runtime deps (minimal)
pip install numpy

# Extraction deps (only needed on the extraction machine)
pip install llama-cpp-python numpy scipy

Usage

Step 1: Download the Donor Model

On the machine with 32 GB RAM:

# ~14 GB download
wget https://huggingface.co/bartowski/Qwen2.5-32B-Instruct-GGUF/resolve/main/Qwen2.5-32B-Instruct-Q3_K_M.gguf \
  -O models/donor.gguf

Step 2: Extract Fingerprints

python hoffgraft_extract.py \
  --model models/donor.gguf \
  --output fingerprints/hoffgraft_fingerprints.npz \
  --prompts-per-domain 200 \
  --n-ctx 2048

# Takes ~1-2 hours. Progress bar shows domain-by-domain.
# Output: ~1-5 MB .npz file

# Optional: delete donor to free 14 GB disk space
rm models/donor.gguf

Step 3: Use at Runtime

from hoffgraft_steer import HoffGraftSteerer

steerer = HoffGraftSteerer(
    fingerprints_path="fingerprints/hoffgraft_fingerprints.npz",
    chassis_model="qwen2.5-coder:14b",  # or 7b, llama3.1:8b, etc.
    ollama_host="http://127.0.0.1:11434",
)

result = steerer.steer("What's on my calendar tomorrow?")
print(result["domain"])     # "scheduling"
print(result["confidence"]) # 0.85
print(result["response"])   # The chassis-generated answer

Or as a CLI:

python hoffgraft_steer.py \
  --fingerprints fingerprints/hoffgraft_fingerprints.npz \
  --chassis qwen2.5-coder:7b \
  --prompt "Debug why the API is returning 500 errors" \
  --verbose

# Output:
# [domain=analysis confidence=0.92 biases=128 time=3421ms]
# The 500 errors are likely caused by...

Step 4: Integration with OpenClaw

Mount as a tool or replace direct Ollama calls:

# In your bot/agent code
steerer = HoffGraftSteerer("fingerprints/hoffgraft_fingerprints.npz")

def generate_smart(prompt: str) -> str:
    result = steerer.steer(prompt)
    return result["response"]

Testing Before Extraction

The steering module works without fingerprints (just falls back to unbiased generation):

# Even with an empty/non-existent fingerprint file, classification works:
python hoffgraft_steer.py \
  --fingerprints /dev/null \
  --chassis qwen2.5-coder:7b \
  --prompt "Schedule a dentist appointment" \
  --stats

Our Gaming PC Fit

Resource Required Available (3080 Ti + 32GB)
Extraction RAM ~16 GB 32 GB ✅
Chassis VRAM (14B) ~9 GB 10 GB ✅
Chassis VRAM (7B) ~5 GB 10 GB ✅ (lots of room)
Fingerprint disk ~5 MB negligible ✅

Files

hoffgraft/
├── README.md
├── hoffgraft_extract.py   # Phase 0: donor extraction
└── hoffgraft_steer.py     # Phase 1: runtime steering

Differences from Gator

Gator HoffGraft
Donor execution C++ placeholder kernel llama-cpp-python (real inference)
Donor model 35B Q4_K_M 32B Q3_K_M (fits 32 GB RAM)
Chassis model 1.5B Q4_K_M 7B–14B (your choice)
Bias capture Final-token only Decision-boundary tokens
Inference backend Custom kernel + llama-server Ollama API (standard)
Integration Standalone system Drop-in for existing Ollama setups
Memory system LanceDB + SQLite None (use your existing)
Persona engine 6-axis trait system None (use your existing)
Dashboard HTMX FastAPI on :8080 None (use your existing)

Limitations

  1. Keyword-based classification — The domain classifier is deterministic keyword matching. Works well for our 5 domains (~90% accuracy) but won't handle edge cases. Easy to upgrade to an embedding-based classifier if needed.

  2. Extraction approximates confidence — Current confidence measurement uses temperature perturbation (checking if the same token appears at 0.1 vs 0.7 temp). A proper logprobs-based extraction would be more precise, but requires lower-level model access.

  3. Single-domain per query — Currently applies one domain's biases per prompt. For multi-domain queries, biases would need blending (future work).

  4. Untested — this is a prototype. Needs a real extraction run on the gaming PC with 32 GB RAM to validate.