# Agent Health Troubleshooting Playbook _In case of emergency — when an agent sounds generic, forgets who they are, or goes off-identity._ ## Symptom: Agent responds generically ("I am a large language model trained by...") ### Root Cause: Session Model Override Stuck OpenClaw's auto-failover can permanently override an agent's model at the session level. If the primary model fails, the session silently switches to whatever model completed the request — and stays there. This overrides the agent's configured model and persona. **Example:** Wadsworth's DM session auto-overrode from `glm-5.1:cloud` to `gemma4:latest` (Gaming PC model). Since `gemma4` is a Google-trained model, it didn't adopt Wadsworth's butler persona and responded as "I am a large language model trained by Google." ### Fix: Reset the Contaminated Session 1. **Find the session key:** ```bash python3 -c " import json with open('/home/hoffmann_admin/.openclaw/agents/main/sessions/sessions.json') as f: store = json.load(f) for k, v in store.items(): if 'telegram:direct' in k and 'heartbeat' not in k: print(f'Key: {k}') print(f' Model: {v.get(\"model\",\"?\")}') print(f' ModelOverride: {v.get(\"modelOverride\",\"none\")}') print(f' ProviderOverride: {v.get(\"providerOverride\",\"none\")}') " ``` 2. **Check for model override contamination:** - If `modelOverride` is set to a model that doesn't match the agent's configured primary → contaminated - If `modelOverrideSource: "auto"` → auto-failover stuck it there 3. **Backup and delete the contaminated session:** ```bash SESSION_FILE="/home/hoffmann_admin/.openclaw/agents/main/sessions/.jsonl" SESSION_KEY="agent:main:telegram:direct:" # Backup first cp "$SESSION_FILE" "${SESSION_FILE}.contaminated-backup.$(date +%Y%m%d%H%M%S)" # Remove from sessions store python3 -c " import json store_path = '/home/hoffmann_admin/.openclaw/agents/main/sessions/sessions.json' with open(store_path) as f: store = json.load(f) key = '$SESSION_KEY' if key in store: del store[key] print(f'Removed: {key}') with open(store_path, 'w') as f: json.dump(store, f, indent=2) " # Remove transcript mv "$SESSION_FILE" "${SESSION_FILE}.contaminated-deleted.$(date +%Y%m%d%H%M%S)" # Restart gateway openclaw gateway restart ``` 4. **Test:** DM the agent with "Who are you?" — should respond with correct identity. ### Prevention - Monitor `modelOverride` and `modelOverrideSource: "auto"` in session stores - If Active Memory or failover keeps overriding the model, the fallback chain may be too aggressive - Consider whether `ollama-remote` (Gaming PC) models should be in the fallback chain for agents that need strong persona adherence — remote models are slow and may trigger cascading failovers --- ## Symptom: Agent takes 30-60 seconds before responding (then may respond generically) ### Root Cause: Active Memory Plugin Blocking on Slow Model Active Memory runs as a **blocking pre-response hook**. It searches session history and generates a recall summary *before* the main agent can respond. If the recall model is too slow, every message has a 30-60s delay before the agent even starts thinking. **Example:** Active Memory was configured for `ollama-remote/gemma4:latest` (Gaming PC). 7 out of 8 calls timed out at 30s. Even switching to `glm-5.1:cloud` (cloud proxy) resulted in 58s completion time — still too slow for a blocking gate. ### Fix: Disable Active Memory or Use a Fast Local Model **Option A — Disable Active Memory (recommended for current hardware):** ```json "active-memory": { "enabled": false, "config": { "agents": ["main"], "model": "ollama/llama3.2:1b-instruct-q4_K_M", "modelFallback": "ollama/glm-5.1:cloud" } } ``` Workspace files (SOUL.md, IDENTITY.md, MEMORY.md, AGENTS.md) provide strong identity grounding without Active Memory. The `sessionMemory` experimental feature (already enabled) indexes session transcripts for recall without the blocking gate. **Option B — Use a fast local model (requires sub-5s inference on Beelink CPU):** As of 2026-04-20, no tested model completes Active Memory recall in under 5s on the Beelink's CPU: - `llama3.2:3b`: 8.4s, failed to recall ("None") - `llama3.2:1b`: 9.9s, partial recall - `llama3.2:1b-instruct-q4_K_M`: 6.4s, best option but still too slow - `qwen3.5:4b`: >30s, timed out Revisit if the Beelink gets a GPU or a significantly faster small model becomes available. ### Active Memory Logs Check Active Memory performance in gateway logs: ```bash grep "active-memory" /tmp/openclaw/openclaw-*.log | grep -E "(start|done)" | python3 -c " import sys, json for line in sys.stdin: try: d = json.loads(line.strip()) msg = d.get('1','') ts = d.get('time','') if 'active-memory' in msg: print(f'{ts[:19]} | {msg}') except: pass " | tail -20 ``` Look for `status=timeout` or `status=empty` — these mean the recall failed and the agent got zero memory context. --- ## Symptom: Agent identity confusion (responds as wrong agent) ### Root Cause: Contaminated Session History If someone operated as Agent A through Agent B's bot/channel, the session history contains mixed identity context. The model then can't determine which persona to adopt. ### Fix: Reset the contaminated session (same procedure as model override above) ### Prevention - **Never operate as Socrates through the main/default bot** — it contaminates Wadsworth's session history with wrong identity context - **Never operate as Daedalus through Wadsworth's bot** — same problem - Always use the correct agent's bot/channel for that agent's work - Inter-agent communication: use `sessions_send` or `sessions_spawn`, **never** the Telegram API --- ## Symptom: `hoffdesk-webhook` service failing ### Root Cause: Template placeholders in systemd unit file The service file at `/etc/systemd/system/hoffdesk-webhook.service` had `__USER__`, `__WORKDIR__`, `__PREFIX__` placeholders that were never substituted. ### Fix: Already applied (2026-04-20) Replaced with actual values: - `__USER__` → `hoffmann_admin` - `__WORKDIR__` → `/home/hoffmann_admin/.openclaw/services/family_assistant` - `__PREFIX__` → `/home/hoffmann_admin/.local/bin` Requires sudo to apply: ```bash sudo systemctl daemon-reload && sudo systemctl restart hoffdesk-webhook ``` --- ## Quick Reference: Agent Config Locations | Item | Path | |------|------| | Main config | `~/.openclaw/openclaw.json` | | Wadsworth workspace | `~/.openclaw/workspace/` | | Socrates workspace | `~/.openclaw/workspace-socrates/` | | Daedalus workspace | `~/.openclaw/workspace-daedalus/` | | Wadsworth sessions | `~/.openclaw/agents/main/sessions/` | | Socrates sessions | `~/.openclaw/agents/socrates/sessions/` | | Daedalus sessions | `~/.openclaw/agents/daedalus/sessions/` | | Gateway logs | `/tmp/openclaw/openclaw-YYYY-MM-DD.log` | | Webhook service | `/etc/systemd/system/hoffdesk-webhook.service` | | Webhook env | `~/.openclaw/workspace/scripts/.env` | ## Quick Reference: Session Reset Commands ```bash # List sessions for an agent python3 -c " import json with open('/home/hoffmann_admin/.openclaw/agents//sessions/sessions.json') as f: store = json.load(f) for k, v in store.items(): if 'direct' in k or 'group' in k: print(f'{k:60s} | model={v.get(\"model\",\"?\"):20s} | override={v.get(\"modelOverride\",\"none\")}') " # Reset a specific session (backup first!) # Replace SESSION_KEY and SESSION_ID with actual values ``` --- _Last updated: 2026-04-20 by Socrates_ _Incident: Wadsworth responding as "trained by Google" due to model override + Active Memory cascade failure_ --- ## FALLBACK CONTINGENCIES (In Case of Emergency) **Source of Truth:** `github.com:NightKnight64/hoffdesk-agents` **Last Backup:** 2026-04-20 — Mono-repo consolidation complete --- ## Scenario: Complete Agent Loss (Corrupted or Deleted) ### Recovery Protocol 1. **Clone the mono-repo:** ```bash git clone git@github.com:NightKnight64/hoffdesk-agents.git ~/hoffdesk-agents-recovery cd ~/hoffdesk-agents-recovery ``` 2. **Restore agent workspace:** ```bash # For Wadsworth cp -r agents/wadsworth/* ~/.openclaw/workspace/ # For Socrates cp -r agents/socrates/* ~/.openclaw/workspace-socrates/ # For Daedalus cp -r agents/daedalus/* ~/.openclaw/workspace-daedalus/ ``` 3. **Restore shared artifacts:** ```bash cp -r shared/* ~/.openclaw/shared/ ``` 4. **Verify openclaw.json workspace paths:** ```bash cat ~/.openclaw/openclaw.json | python3 -c " import json, sys d = json.load(sys.stdin) for agent in d.get('agents', {}).get('list', []): print(f'{agent[\"id\"]}: {agent[\"workspace\"]}') " ``` - `main` → should point to `~/.openclaw/workspace` (or `~/hoffdesk-agents/agents/wadsworth` if migrated) - `socrates` → should point to `~/.openclaw/workspace-socrates` - `daedalus` → should point to `~/.openclaw/workspace-daedalus` 5. **Restart gateway:** ```bash openclaw gateway restart ``` 6. **Test:** Send "Who are you?" to each agent — should respond with correct identity. --- ## Scenario: Gateway Completely Broken (Won't Start) ### Recovery Protocol 1. **Check logs for config errors:** ```bash tail -100 /tmp/openclaw/openclaw-$(date +%Y-%m-%d).log ``` 2. **Validate openclaw.json:** ```bash openclaw config validate ``` 3. **If config corrupted, restore from mono-repo:** ```bash cp ~/hoffdesk-agents/openclaw.json ~/.openclaw/openclaw.json openclaw gateway restart ``` 4. **If still broken, reset to minimal config:** ```bash # Backup current cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.broken.$(date +%Y%m%d%H%M%S) # Generate minimal config openclaw config # Follow prompts to rebuild ``` --- ## Scenario: Beelink Hardware Failure (Total Loss) ### Recovery Protocol **Assumptions:** You have access to another Linux machine or fresh Ubuntu install. 1. **Install OpenClaw:** ```bash npm install -g openclaw ``` 2. **Clone mono-repo:** ```bash git clone git@github.com:NightKnight64/hoffdesk-agents.git ~/hoffdesk-agents ``` 3. **Recreate directory structure:** ```bash mkdir -p ~/.openclaw cp -r ~/hoffdesk-agents/agents/wadsworth ~/.openclaw/workspace cp -r ~/hoffdesk-agents/agents/socrates ~/.openclaw/workspace-socrates cp -r ~/hoffdesk-agents/agents/daedalus ~/.openclaw/workspace-daedalus cp -r ~/hoffdesk-agents/shared ~/.openclaw/shared cp ~/hoffdesk-agents/openclaw.json ~/.openclaw/ ``` 4. **Reconfigure secrets:** - Telegram bot tokens - Cloudflare tokens - Any API keys (not stored in git) 5. **Start gateway:** ```bash openclaw gateway start ``` --- ## Scenario: Single Session Corrupted (Agent works but one chat is broken) ### Quick Fix ```bash # Find the corrupted session python3 -c " import json path = '/home/hoffmann_admin/.openclaw/agents/main/sessions/sessions.json' with open(path) as f: store = json.load(f) for k, v in store.items(): if 'telegram:direct' in k: print(f'{k} | model={v.get(\"model\",\"?\")} | override={v.get(\"modelOverride\",\"none\")}') " # Reset that specific session (backup first!) # Then DM the agent again — fresh session will spawn automatically ``` --- ## Critical Files Not in Git (Must Reconfigure Manually) | File | Purpose | Backup Strategy | |------|---------|-----------------| | `~/.openclaw/openclaw.json` | Main config | In mono-repo | | `~/.openclaw/credentials/` | OAuth tokens | **NOT IN GIT** — re-auth required | | Telegram bot tokens | Messaging | **NOT IN GIT** — reconfigure required | | Cloudflare API tokens | Tunnel/Pages | **NOT IN GIT** — reconfigure required | | `~/.openclaw/services/.env` | Service secrets | Symlinked to workspace — in mono-repo | --- ## Emergency Contacts - **Primary:** Wadsworth (main agent) — handles coordination - **Backend/Issues:** Socrates 🧠 - **Frontend/Issues:** Daedalus 🎨 - **Director:** Matt --- _Last updated: 2026-04-20 by Wadsworth_ _Mono-repo migration: github.com:NightKnight64/hoffdesk-agents_