📄 troubleshooting-playbook.md 12,383 bytes Apr 20, 2026 📋 Raw

Agent Health Troubleshooting Playbook

In case of emergency — when an agent sounds generic, forgets who they are, or goes off-identity.

Symptom: Agent responds generically ("I am a large language model trained by...")

Root Cause: Session Model Override Stuck

OpenClaw's auto-failover can permanently override an agent's model at the session level. If the primary model fails, the session silently switches to whatever model completed the request — and stays there. This overrides the agent's configured model and persona.

Example: Wadsworth's DM session auto-overrode from glm-5.1:cloud to gemma4:latest (Gaming PC model). Since gemma4 is a Google-trained model, it didn't adopt Wadsworth's butler persona and responded as "I am a large language model trained by Google."

Fix: Reset the Contaminated Session

  1. Find the session key:
    bash python3 -c " import json with open('/home/hoffmann_admin/.openclaw/agents/main/sessions/sessions.json') as f: store = json.load(f) for k, v in store.items(): if 'telegram:direct' in k and 'heartbeat' not in k: print(f'Key: {k}') print(f' Model: {v.get(\"model\",\"?\")}') print(f' ModelOverride: {v.get(\"modelOverride\",\"none\")}') print(f' ProviderOverride: {v.get(\"providerOverride\",\"none\")}') "

  2. Check for model override contamination:
    - If modelOverride is set to a model that doesn't match the agent's configured primary → contaminated
    - If modelOverrideSource: "auto" → auto-failover stuck it there

  3. Backup and delete the contaminated session:
    ```bash
    SESSION_FILE="/home/hoffmann_admin/.openclaw/agents/main/sessions/.jsonl"
    SESSION_KEY="agent:main:telegram:direct:"

# Backup first
cp "$SESSION_FILE" "${SESSION_FILE}.contaminated-backup.$(date +%Y%m%d%H%M%S)"

# Remove from sessions store
python3 -c "
import json
store_path = '/home/hoffmann_admin/.openclaw/agents/main/sessions/sessions.json'
with open(store_path) as f:
store = json.load(f)
key = '$SESSION_KEY'
if key in store:
del store[key]
print(f'Removed: {key}')
with open(store_path, 'w') as f:
json.dump(store, f, indent=2)
"

# Remove transcript
mv "$SESSION_FILE" "${SESSION_FILE}.contaminated-deleted.$(date +%Y%m%d%H%M%S)"

# Restart gateway
openclaw gateway restart
```

  1. Test: DM the agent with "Who are you?" — should respond with correct identity.

Prevention

  • Monitor modelOverride and modelOverrideSource: "auto" in session stores
  • If Active Memory or failover keeps overriding the model, the fallback chain may be too aggressive
  • Consider whether ollama-remote (Gaming PC) models should be in the fallback chain for agents that need strong persona adherence — remote models are slow and may trigger cascading failovers

Symptom: Agent takes 30-60 seconds before responding (then may respond generically)

Root Cause: Active Memory Plugin Blocking on Slow Model

Active Memory runs as a blocking pre-response hook. It searches session history and generates a recall summary before the main agent can respond. If the recall model is too slow, every message has a 30-60s delay before the agent even starts thinking.

Example: Active Memory was configured for ollama-remote/gemma4:latest (Gaming PC). 7 out of 8 calls timed out at 30s. Even switching to glm-5.1:cloud (cloud proxy) resulted in 58s completion time — still too slow for a blocking gate.

Fix: Disable Active Memory or Use a Fast Local Model

Option A — Disable Active Memory (recommended for current hardware):

"active-memory": {
  "enabled": false,
  "config": {
    "agents": ["main"],
    "model": "ollama/llama3.2:1b-instruct-q4_K_M",
    "modelFallback": "ollama/glm-5.1:cloud"
  }
}

Workspace files (SOUL.md, IDENTITY.md, MEMORY.md, AGENTS.md) provide strong identity grounding without Active Memory. The sessionMemory experimental feature (already enabled) indexes session transcripts for recall without the blocking gate.

Option B — Use a fast local model (requires sub-5s inference on Beelink CPU):
As of 2026-04-20, no tested model completes Active Memory recall in under 5s on the Beelink's CPU:
- llama3.2:3b: 8.4s, failed to recall ("None")
- llama3.2:1b: 9.9s, partial recall
- llama3.2:1b-instruct-q4_K_M: 6.4s, best option but still too slow
- qwen3.5:4b: >30s, timed out

Revisit if the Beelink gets a GPU or a significantly faster small model becomes available.

Active Memory Logs

Check Active Memory performance in gateway logs:

grep "active-memory" /tmp/openclaw/openclaw-*.log | grep -E "(start|done)" | python3 -c "
import sys, json
for line in sys.stdin:
    try:
        d = json.loads(line.strip())
        msg = d.get('1','')
        ts = d.get('time','')
        if 'active-memory' in msg:
            print(f'{ts[:19]} | {msg}')
    except: pass
" | tail -20

Look for status=timeout or status=empty — these mean the recall failed and the agent got zero memory context.


Symptom: Agent identity confusion (responds as wrong agent)

Root Cause: Contaminated Session History

If someone operated as Agent A through Agent B's bot/channel, the session history contains mixed identity context. The model then can't determine which persona to adopt.

Fix: Reset the contaminated session (same procedure as model override above)

Prevention

  • Never operate as Socrates through the main/default bot — it contaminates Wadsworth's session history with wrong identity context
  • Never operate as Daedalus through Wadsworth's bot — same problem
  • Always use the correct agent's bot/channel for that agent's work
  • Inter-agent communication: use sessions_send or sessions_spawn, never the Telegram API

Symptom: hoffdesk-webhook service failing

Root Cause: Template placeholders in systemd unit file

The service file at /etc/systemd/system/hoffdesk-webhook.service had __USER__, __WORKDIR__, __PREFIX__ placeholders that were never substituted.

Fix: Already applied (2026-04-20)

Replaced with actual values:
- __USER__hoffmann_admin
- __WORKDIR__/home/hoffmann_admin/.openclaw/services/family_assistant
- __PREFIX__/home/hoffmann_admin/.local/bin

Requires sudo to apply:

sudo systemctl daemon-reload && sudo systemctl restart hoffdesk-webhook

Quick Reference: Agent Config Locations

Item Path
Main config ~/.openclaw/openclaw.json
Wadsworth workspace ~/.openclaw/workspace/
Socrates workspace ~/.openclaw/workspace-socrates/
Daedalus workspace ~/.openclaw/workspace-daedalus/
Wadsworth sessions ~/.openclaw/agents/main/sessions/
Socrates sessions ~/.openclaw/agents/socrates/sessions/
Daedalus sessions ~/.openclaw/agents/daedalus/sessions/
Gateway logs /tmp/openclaw/openclaw-YYYY-MM-DD.log
Webhook service /etc/systemd/system/hoffdesk-webhook.service
Webhook env ~/.openclaw/workspace/scripts/.env

Quick Reference: Session Reset Commands

# List sessions for an agent
python3 -c "
import json
with open('/home/hoffmann_admin/.openclaw/agents/<AGENT>/sessions/sessions.json') as f:
    store = json.load(f)
for k, v in store.items():
    if 'direct' in k or 'group' in k:
        print(f'{k:60s} | model={v.get(\"model\",\"?\"):20s} | override={v.get(\"modelOverride\",\"none\")}')
"

# Reset a specific session (backup first!)
# Replace SESSION_KEY and SESSION_ID with actual values

Last updated: 2026-04-20 by Socrates
Incident: Wadsworth responding as "trained by Google" due to model override + Active Memory cascade failure


FALLBACK CONTINGENCIES (In Case of Emergency)

Source of Truth: github.com:NightKnight64/hoffdesk-agents
Last Backup: 2026-04-20 — Mono-repo consolidation complete


Scenario: Complete Agent Loss (Corrupted or Deleted)

Recovery Protocol

  1. Clone the mono-repo:
    bash git clone git@github.com:NightKnight64/hoffdesk-agents.git ~/hoffdesk-agents-recovery cd ~/hoffdesk-agents-recovery

  2. Restore agent workspace:
    ```bash
    # For Wadsworth
    cp -r agents/wadsworth/* ~/.openclaw/workspace/

# For Socrates
cp -r agents/socrates/* ~/.openclaw/workspace-socrates/

# For Daedalus
cp -r agents/daedalus/* ~/.openclaw/workspace-daedalus/
```

  1. Restore shared artifacts:
    bash cp -r shared/* ~/.openclaw/shared/

  2. Verify openclaw.json workspace paths:
    bash cat ~/.openclaw/openclaw.json | python3 -c " import json, sys d = json.load(sys.stdin) for agent in d.get('agents', {}).get('list', []): print(f'{agent[\"id\"]}: {agent[\"workspace\"]}') "
    - main → should point to ~/.openclaw/workspace (or ~/hoffdesk-agents/agents/wadsworth if migrated)
    - socrates → should point to ~/.openclaw/workspace-socrates
    - daedalus → should point to ~/.openclaw/workspace-daedalus

  3. Restart gateway:
    bash openclaw gateway restart

  4. Test: Send "Who are you?" to each agent — should respond with correct identity.


Scenario: Gateway Completely Broken (Won't Start)

Recovery Protocol

  1. Check logs for config errors:
    bash tail -100 /tmp/openclaw/openclaw-$(date +%Y-%m-%d).log

  2. Validate openclaw.json:
    bash openclaw config validate

  3. If config corrupted, restore from mono-repo:
    bash cp ~/hoffdesk-agents/openclaw.json ~/.openclaw/openclaw.json openclaw gateway restart

  4. If still broken, reset to minimal config:
    ```bash
    # Backup current
    cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.broken.$(date +%Y%m%d%H%M%S)

# Generate minimal config
openclaw config
# Follow prompts to rebuild
```


Recovery Protocol

Assumptions: You have access to another Linux machine or fresh Ubuntu install.

  1. Install OpenClaw:
    bash npm install -g openclaw

  2. Clone mono-repo:
    bash git clone git@github.com:NightKnight64/hoffdesk-agents.git ~/hoffdesk-agents

  3. Recreate directory structure:
    bash mkdir -p ~/.openclaw cp -r ~/hoffdesk-agents/agents/wadsworth ~/.openclaw/workspace cp -r ~/hoffdesk-agents/agents/socrates ~/.openclaw/workspace-socrates cp -r ~/hoffdesk-agents/agents/daedalus ~/.openclaw/workspace-daedalus cp -r ~/hoffdesk-agents/shared ~/.openclaw/shared cp ~/hoffdesk-agents/openclaw.json ~/.openclaw/

  4. Reconfigure secrets:
    - Telegram bot tokens
    - Cloudflare tokens
    - Any API keys (not stored in git)

  5. Start gateway:
    bash openclaw gateway start


Scenario: Single Session Corrupted (Agent works but one chat is broken)

Quick Fix

# Find the corrupted session
python3 -c "
import json
path = '/home/hoffmann_admin/.openclaw/agents/main/sessions/sessions.json'
with open(path) as f:
    store = json.load(f)
for k, v in store.items():
    if 'telegram:direct' in k:
        print(f'{k} | model={v.get(\"model\",\"?\")} | override={v.get(\"modelOverride\",\"none\")}')
"

# Reset that specific session (backup first!)
# Then DM the agent again — fresh session will spawn automatically

Critical Files Not in Git (Must Reconfigure Manually)

File Purpose Backup Strategy
~/.openclaw/openclaw.json Main config In mono-repo
~/.openclaw/credentials/ OAuth tokens NOT IN GIT — re-auth required
Telegram bot tokens Messaging NOT IN GIT — reconfigure required
Cloudflare API tokens Tunnel/Pages NOT IN GIT — reconfigure required
~/.openclaw/services/.env Service secrets Symlinked to workspace — in mono-repo

Emergency Contacts

  • Primary: Wadsworth (main agent) — handles coordination
  • Backend/Issues: Socrates 🧠
  • Frontend/Issues: Daedalus 🎨
  • Director: Matt

Last updated: 2026-04-20 by Wadsworth
Mono-repo migration: github.com:NightKnight64/hoffdesk-agents