📄 troubleshooting-playbook.md 12,383 bytes Apr 20, 2026 📋 Raw

Agent Health Troubleshooting Playbook

In case of emergency — when an agent sounds generic, forgets who they are, or goes off-identity.

Symptom: Agent responds generically ("I am a large language model trained by...")

Root Cause: Session Model Override Stuck

OpenClaw's auto-failover can permanently override an agent's model at the session level. If the primary model fails, the session silently switches to whatever model completed the request — and stays there. This overrides the agent's configured model and persona.

Example: Wadsworth's DM session auto-overrode from glm-5.1:cloud to gemma4:latest (Gaming PC model). Since gemma4 is a Google-trained model, it didn't adopt Wadsworth's butler persona and responded as "I am a large language model trained by Google."

Fix: Reset the Contaminated Session

Find the session key:
bash python3 -c " import json with open('/home/hoffmann_admin/.openclaw/agents/main/sessions/sessions.json') as f: store = json.load(f) for k, v in store.items(): if 'telegram:direct' in k and 'heartbeat' not in k: print(f'Key: {k}') print(f' Model: {v.get(\"model\",\"?\")}') print(f' ModelOverride: {v.get(\"modelOverride\",\"none\")}') print(f' ProviderOverride: {v.get(\"providerOverride\",\"none\")}') "
Check for model override contamination:
- If modelOverride is set to a model that doesn't match the agent's configured primary → contaminated
- If modelOverrideSource: "auto" → auto-failover stuck it there
Backup and delete the contaminated session:
```bash
SESSION_FILE="/home/hoffmann_admin/.openclaw/agents/main/sessions/.jsonl"
SESSION_KEY="agent:main:telegram:direct:"

# Backup first
cp "$SESSION_FILE" "${SESSION_FILE}.contaminated-backup.$(date +%Y%m%d%H%M%S)"

# Remove from sessions store
python3 -c "
import json
store_path = '/home/hoffmann_admin/.openclaw/agents/main/sessions/sessions.json'
with open(store_path) as f:
store = json.load(f)
key = '$SESSION_KEY'
if key in store:
del store[key]
print(f'Removed: {key}')
with open(store_path, 'w') as f:
json.dump(store, f, indent=2)
"

# Remove transcript
mv "$SESSION_FILE" "${SESSION_FILE}.contaminated-deleted.$(date +%Y%m%d%H%M%S)"

# Restart gateway
openclaw gateway restart
```

Test: DM the agent with "Who are you?" — should respond with correct identity.

Prevention

Monitor modelOverride and modelOverrideSource: "auto" in session stores
If Active Memory or failover keeps overriding the model, the fallback chain may be too aggressive
Consider whether ollama-remote (Gaming PC) models should be in the fallback chain for agents that need strong persona adherence — remote models are slow and may trigger cascading failovers

Symptom: Agent takes 30-60 seconds before responding (then may respond generically)

Root Cause: Active Memory Plugin Blocking on Slow Model

Active Memory runs as a blocking pre-response hook. It searches session history and generates a recall summary before the main agent can respond. If the recall model is too slow, every message has a 30-60s delay before the agent even starts thinking.

Example: Active Memory was configured for ollama-remote/gemma4:latest (Gaming PC). 7 out of 8 calls timed out at 30s. Even switching to glm-5.1:cloud (cloud proxy) resulted in 58s completion time — still too slow for a blocking gate.

Fix: Disable Active Memory or Use a Fast Local Model

Option A — Disable Active Memory (recommended for current hardware):

"active-memory": {
  "enabled": false,
  "config": {
    "agents": ["main"],
    "model": "ollama/llama3.2:1b-instruct-q4_K_M",
    "modelFallback": "ollama/glm-5.1:cloud"
  }
}

Workspace files (SOUL.md, IDENTITY.md, MEMORY.md, AGENTS.md) provide strong identity grounding without Active Memory. The sessionMemory experimental feature (already enabled) indexes session transcripts for recall without the blocking gate.

Option B — Use a fast local model (requires sub-5s inference on Beelink CPU):
As of 2026-04-20, no tested model completes Active Memory recall in under 5s on the Beelink's CPU:
- llama3.2:3b: 8.4s, failed to recall ("None")
- llama3.2:1b: 9.9s, partial recall
- llama3.2:1b-instruct-q4_K_M: 6.4s, best option but still too slow
- qwen3.5:4b: >30s, timed out

Revisit if the Beelink gets a GPU or a significantly faster small model becomes available.

Active Memory Logs

Check Active Memory performance in gateway logs:

grep "active-memory" /tmp/openclaw/openclaw-*.log | grep -E "(start|done)" | python3 -c "
import sys, json
for line in sys.stdin:
    try:
        d = json.loads(line.strip())
        msg = d.get('1','')
        ts = d.get('time','')
        if 'active-memory' in msg:
            print(f'{ts[:19]} | {msg}')
    except: pass
" | tail -20

Look for status=timeout or status=empty — these mean the recall failed and the agent got zero memory context.

Symptom: Agent identity confusion (responds as wrong agent)

Root Cause: Contaminated Session History

If someone operated as Agent A through Agent B's bot/channel, the session history contains mixed identity context. The model then can't determine which persona to adopt.

Fix: Reset the contaminated session (same procedure as model override above)

Prevention

Never operate as Socrates through the main/default bot — it contaminates Wadsworth's session history with wrong identity context
Never operate as Daedalus through Wadsworth's bot — same problem
Always use the correct agent's bot/channel for that agent's work
Inter-agent communication: use sessions_send or sessions_spawn, never the Telegram API

Symptom: `hoffdesk-webhook` service failing

Root Cause: Template placeholders in systemd unit file

The service file at /etc/systemd/system/hoffdesk-webhook.service had __USER__, __WORKDIR__, __PREFIX__ placeholders that were never substituted.

Fix: Already applied (2026-04-20)

Replaced with actual values:
- __USER__ → hoffmann_admin
- __WORKDIR__ → /home/hoffmann_admin/.openclaw/services/family_assistant
- __PREFIX__ → /home/hoffmann_admin/.local/bin

Requires sudo to apply:

sudo systemctl daemon-reload && sudo systemctl restart hoffdesk-webhook

Quick Reference: Agent Config Locations

Item	Path
Main config	`~/.openclaw/openclaw.json`
Wadsworth workspace	`~/.openclaw/workspace/`
Socrates workspace	`~/.openclaw/workspace-socrates/`
Daedalus workspace	`~/.openclaw/workspace-daedalus/`
Wadsworth sessions	`~/.openclaw/agents/main/sessions/`
Socrates sessions	`~/.openclaw/agents/socrates/sessions/`
Daedalus sessions	`~/.openclaw/agents/daedalus/sessions/`
Gateway logs	`/tmp/openclaw/openclaw-YYYY-MM-DD.log`
Webhook service	`/etc/systemd/system/hoffdesk-webhook.service`
Webhook env	`~/.openclaw/workspace/scripts/.env`

Quick Reference: Session Reset Commands

# List sessions for an agent
python3 -c "
import json
with open('/home/hoffmann_admin/.openclaw/agents/<AGENT>/sessions/sessions.json') as f:
    store = json.load(f)
for k, v in store.items():
    if 'direct' in k or 'group' in k:
        print(f'{k:60s} | model={v.get(\"model\",\"?\"):20s} | override={v.get(\"modelOverride\",\"none\")}')
"

# Reset a specific session (backup first!)
# Replace SESSION_KEY and SESSION_ID with actual values

Last updated: 2026-04-20 by Socrates
Incident: Wadsworth responding as "trained by Google" due to model override + Active Memory cascade failure

FALLBACK CONTINGENCIES (In Case of Emergency)

Source of Truth: github.com:NightKnight64/hoffdesk-agents
Last Backup: 2026-04-20 — Mono-repo consolidation complete

Scenario: Complete Agent Loss (Corrupted or Deleted)

Recovery Protocol

Clone the mono-repo:
bash git clone git@github.com:NightKnight64/hoffdesk-agents.git ~/hoffdesk-agents-recovery cd ~/hoffdesk-agents-recovery
Restore agent workspace:
```bash
# For Wadsworth
cp -r agents/wadsworth/* ~/.openclaw/workspace/

# For Socrates
cp -r agents/socrates/* ~/.openclaw/workspace-socrates/

# For Daedalus
cp -r agents/daedalus/* ~/.openclaw/workspace-daedalus/
```

Restore shared artifacts:
bash cp -r shared/* ~/.openclaw/shared/
Verify openclaw.json workspace paths:
bash cat ~/.openclaw/openclaw.json | python3 -c " import json, sys d = json.load(sys.stdin) for agent in d.get('agents', {}).get('list', []): print(f'{agent[\"id\"]}: {agent[\"workspace\"]}') "
- main → should point to ~/.openclaw/workspace (or ~/hoffdesk-agents/agents/wadsworth if migrated)
- socrates → should point to ~/.openclaw/workspace-socrates
- daedalus → should point to ~/.openclaw/workspace-daedalus
Restart gateway:
bash openclaw gateway restart
Test: Send "Who are you?" to each agent — should respond with correct identity.

Scenario: Gateway Completely Broken (Won't Start)

Recovery Protocol

Check logs for config errors:
bash tail -100 /tmp/openclaw/openclaw-$(date +%Y-%m-%d).log
Validate openclaw.json:
bash openclaw config validate
If config corrupted, restore from mono-repo:
bash cp ~/hoffdesk-agents/openclaw.json ~/.openclaw/openclaw.json openclaw gateway restart
If still broken, reset to minimal config:
```bash
# Backup current
cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.broken.$(date +%Y%m%d%H%M%S)

# Generate minimal config
openclaw config
# Follow prompts to rebuild
```

Scenario: Beelink Hardware Failure (Total Loss)

Recovery Protocol

Assumptions: You have access to another Linux machine or fresh Ubuntu install.

Install OpenClaw:
bash npm install -g openclaw
Clone mono-repo:
bash git clone git@github.com:NightKnight64/hoffdesk-agents.git ~/hoffdesk-agents
Recreate directory structure:
bash mkdir -p ~/.openclaw cp -r ~/hoffdesk-agents/agents/wadsworth ~/.openclaw/workspace cp -r ~/hoffdesk-agents/agents/socrates ~/.openclaw/workspace-socrates cp -r ~/hoffdesk-agents/agents/daedalus ~/.openclaw/workspace-daedalus cp -r ~/hoffdesk-agents/shared ~/.openclaw/shared cp ~/hoffdesk-agents/openclaw.json ~/.openclaw/
Reconfigure secrets:
- Telegram bot tokens
- Cloudflare tokens
- Any API keys (not stored in git)
Start gateway:
bash openclaw gateway start

Scenario: Single Session Corrupted (Agent works but one chat is broken)

Quick Fix

# Find the corrupted session
python3 -c "
import json
path = '/home/hoffmann_admin/.openclaw/agents/main/sessions/sessions.json'
with open(path) as f:
    store = json.load(f)
for k, v in store.items():
    if 'telegram:direct' in k:
        print(f'{k} | model={v.get(\"model\",\"?\")} | override={v.get(\"modelOverride\",\"none\")}')
"

# Reset that specific session (backup first!)
# Then DM the agent again — fresh session will spawn automatically

Critical Files Not in Git (Must Reconfigure Manually)

File	Purpose	Backup Strategy
`~/.openclaw/openclaw.json`	Main config	In mono-repo
`~/.openclaw/credentials/`	OAuth tokens	NOT IN GIT — re-auth required
Telegram bot tokens	Messaging	NOT IN GIT — reconfigure required
Cloudflare API tokens	Tunnel/Pages	NOT IN GIT — reconfigure required
`~/.openclaw/services/.env`	Service secrets	Symlinked to workspace — in mono-repo

Emergency Contacts

Primary: Wadsworth (main agent) — handles coordination
Backend/Issues: Socrates 🧠
Frontend/Issues: Daedalus 🎨
Director: Matt

Last updated: 2026-04-20 by Wadsworth
Mono-repo migration: github.com:NightKnight64/hoffdesk-agents

← Back

Agent Health Troubleshooting Playbook

Symptom: Agent responds generically ("I am a large language model trained by...")

Root Cause: Session Model Override Stuck

Fix: Reset the Contaminated Session

Prevention

Symptom: Agent takes 30-60 seconds before responding (then may respond generically)

Root Cause: Active Memory Plugin Blocking on Slow Model

Fix: Disable Active Memory or Use a Fast Local Model

Active Memory Logs

Symptom: Agent identity confusion (responds as wrong agent)

Root Cause: Contaminated Session History

Fix: Reset the contaminated session (same procedure as model override above)

Prevention

Symptom: hoffdesk-webhook service failing

Root Cause: Template placeholders in systemd unit file

Fix: Already applied (2026-04-20)

Quick Reference: Agent Config Locations

Quick Reference: Session Reset Commands

FALLBACK CONTINGENCIES (In Case of Emergency)

Scenario: Complete Agent Loss (Corrupted or Deleted)

Recovery Protocol

Scenario: Gateway Completely Broken (Won't Start)

Recovery Protocol

Scenario: Beelink Hardware Failure (Total Loss)

Recovery Protocol

Scenario: Single Session Corrupted (Agent works but one chat is broken)

Quick Fix

Critical Files Not in Git (Must Reconfigure Manually)

Emergency Contacts

Symptom: `hoffdesk-webhook` service failing