# HoffGraft

**Donor logic steering for local LLMs.** Run a 32B model's reasoning patterns through a 7B–14B chassis — on a single consumer GPU.

Inspired by the [Gator project](https://github.com/Mexor-dev/Gator)'s 35B graft architecture, adapted for Ollama + our existing inference stack.

## How It Works

```
┌──────────────────────────────────────────────────────┐
│  PHASE 0: One-time extraction (~1-2 hrs, CPU-only)   │
│                                                      │
│  32B Donor (Qwen2.5 32B Instruct Q3_K_M)            │
│  → Loaded via mmap, CPU-only (~14 GB RAM)            │
│  → 5 domains × N prompts each                       │
│  → Capture token-level probability biases            │
│  → Save as compressed .npz (~1-5 MB)                │
│  → Delete donor model — it's not needed anymore      │
│                                                      │
│  PHASE 1: Runtime steering (~100ms overhead)         │
│                                                      │
│  User prompt → Domain classifier → Bias lookup       │
│  → Ollama generate(logit_bias={biases})              │
│  → 7B-14B chassis responds with donor-influenced     │
│    token selection                                    │
│                                                      │
│  Chassis model stays in GPU VRAM at all times.       │
└──────────────────────────────────────────────────────┘
```

### The Insight

When a 32B model reasons about scheduling, it consistently favours certain tokens over others at decision boundaries (e.g., "the best time is..." vs "perhaps we could..."). These preferences can be captured as a **bias vector** and injected into a 7B model's sampling loop. The small model generates the text, but the big model steers the token choices — especially at critical decision points.

This gives you big-model reasoning quality at small-model latency and VRAM cost.

## Domains

| Domain | Purpose | Example Queries |
|--------|---------|----------------|
| `scheduling` | Calendar conflicts, time estimation, coordination | "When should we schedule the dentist?" |
| `email_triage` | Classification, priority, routing | "What do I need to action in my inbox?" |
| `coordination` | Family messages, task extraction, shopping lists | "Can someone pick up Harper at 3?" |
| `content_generation` | Summaries, briefings, writing | "Generate today's morning briefing" |
| `analysis` | Debugging, root cause, tradeoffs | "Why is the API returning 500s?" |

## Requirements

### Extraction (one-time)
- **32 GB system RAM** (for 32B Q3_K_M ~14 GB donor)
- `llama-cpp-python`, `numpy`, `scipy`
- 1-2 hours of CPU time
- No GPU needed for extraction

### Runtime
- **Any GPU that fits the chassis model** (10 GB VRAM for 14B Q4_K_M)
- `numpy` (fingerprints)
- `ollama` (chassis inference)
- No extra deps — uses urllib for Ollama API calls

## Installation

```bash
cd ~/.openclaw/shared/hoffgraft

# Runtime deps (minimal)
pip install numpy

# Extraction deps (only needed on the extraction machine)
pip install llama-cpp-python numpy scipy
```

## Usage

### Step 1: Download the Donor Model

On the machine with 32 GB RAM:

```bash
# ~14 GB download
wget https://huggingface.co/bartowski/Qwen2.5-32B-Instruct-GGUF/resolve/main/Qwen2.5-32B-Instruct-Q3_K_M.gguf \
  -O models/donor.gguf
```

### Step 2: Extract Fingerprints

```bash
python hoffgraft_extract.py \
  --model models/donor.gguf \
  --output fingerprints/hoffgraft_fingerprints.npz \
  --prompts-per-domain 200 \
  --n-ctx 2048

# Takes ~1-2 hours. Progress bar shows domain-by-domain.
# Output: ~1-5 MB .npz file

# Optional: delete donor to free 14 GB disk space
rm models/donor.gguf
```

### Step 3: Use at Runtime

```python
from hoffgraft_steer import HoffGraftSteerer

steerer = HoffGraftSteerer(
    fingerprints_path="fingerprints/hoffgraft_fingerprints.npz",
    chassis_model="qwen2.5-coder:14b",  # or 7b, llama3.1:8b, etc.
    ollama_host="http://127.0.0.1:11434",
)

result = steerer.steer("What's on my calendar tomorrow?")
print(result["domain"])     # "scheduling"
print(result["confidence"]) # 0.85
print(result["response"])   # The chassis-generated answer
```

Or as a CLI:

```bash
python hoffgraft_steer.py \
  --fingerprints fingerprints/hoffgraft_fingerprints.npz \
  --chassis qwen2.5-coder:7b \
  --prompt "Debug why the API is returning 500 errors" \
  --verbose

# Output:
# [domain=analysis confidence=0.92 biases=128 time=3421ms]
# The 500 errors are likely caused by...
```

### Step 4: Integration with OpenClaw

Mount as a tool or replace direct Ollama calls:

```python
# In your bot/agent code
steerer = HoffGraftSteerer("fingerprints/hoffgraft_fingerprints.npz")

def generate_smart(prompt: str) -> str:
    result = steerer.steer(prompt)
    return result["response"]
```

## Testing Before Extraction

The steering module works without fingerprints (just falls back to unbiased generation):

```bash
# Even with an empty/non-existent fingerprint file, classification works:
python hoffgraft_steer.py \
  --fingerprints /dev/null \
  --chassis qwen2.5-coder:7b \
  --prompt "Schedule a dentist appointment" \
  --stats
```

## Our Gaming PC Fit

| Resource | Required | Available (3080 Ti + 32GB) |
|----------|----------|----------------------------|
| Extraction RAM | ~16 GB | 32 GB ✅ |
| Chassis VRAM (14B) | ~9 GB | 10 GB ✅ |
| Chassis VRAM (7B) | ~5 GB | 10 GB ✅ (lots of room) |
| Fingerprint disk | ~5 MB | negligible ✅ |

## Files

```
hoffgraft/
├── README.md
├── hoffgraft_extract.py   # Phase 0: donor extraction
└── hoffgraft_steer.py     # Phase 1: runtime steering
```

## Differences from Gator

| | Gator | HoffGraft |
|---|---|---|
| Donor execution | C++ placeholder kernel | llama-cpp-python (real inference) |
| Donor model | 35B Q4_K_M | 32B Q3_K_M (fits 32 GB RAM) |
| Chassis model | 1.5B Q4_K_M | 7B–14B (your choice) |
| Bias capture | Final-token only | Decision-boundary tokens |
| Inference backend | Custom kernel + llama-server | Ollama API (standard) |
| Integration | Standalone system | Drop-in for existing Ollama setups |
| Memory system | LanceDB + SQLite | None (use your existing) |
| Persona engine | 6-axis trait system | None (use your existing) |
| Dashboard | HTMX FastAPI on :8080 | None (use your existing) |

## Limitations

1. **Keyword-based classification** — The domain classifier is deterministic keyword matching. Works well for our 5 domains (~90% accuracy) but won't handle edge cases. Easy to upgrade to an embedding-based classifier if needed.

2. **Extraction approximates confidence** — Current confidence measurement uses temperature perturbation (checking if the same token appears at 0.1 vs 0.7 temp). A proper logprobs-based extraction would be more precise, but requires lower-level model access.

3. **Single-domain per query** — Currently applies one domain's biases per prompt. For multi-domain queries, biases would need blending (future work).

4. **Untested — this is a prototype.** Needs a real extraction run on the gaming PC with 32 GB RAM to validate.