# Killing My Agents in the Name of Self-Improvement **Category:** OpenClaw Hacks & Tutorials **Status:** Draft **Word count:** ~800 words **Read time:** 4 min **Tags:** `openclaw`, `agents`, `resilience`, `model-config`, `reliability` --- It started innocently enough. I wanted to see what my agents could do with a better model. My main bot, Wadsworth — the one who handles morning briefings, calendar checks, and routing messages — was running on a solid daily driver. But a new model dropped. DeepSeek V4 Pro. Supposedly smarter, faster, better at reasoning. I figured I'd set him up with it and see how far he could go. I opened the config, changed the model ID, and hit save. Five minutes later, the entire bot fleet was dead. ## The Setup I run three agents on the same OpenClaw gateway: - **Wadsworth** — Chief of Staff. Routes messages, runs the schedule, keeps the trains running. - **Daedalus** — Design architect. Frontend work, UI decisions, design system. - **Socrates** — Backend architect. APIs, infra, pipelines. Each has its own model preferences. Wadsworth likes deepseek-v4. Daedalus prefers kimi-k2.5. Socrates runs on glm-5.1. They all fall back to each other's models if theirs goes down. It's polite, it's redundant, it's worked fine for months. Until I changed one model ID. ## The Thing That Went Wrong The model I selected — `deepseek-v4-pro:cloud` — returned **503 Service Unavailable** from the Ollama cloud relay. Not a config error. Not a typo. The model name was correct. The provider just wasn't serving it at that moment. Could have been a capacity issue. Could have been a routing hiccup. Whatever it was, the gateway couldn't reach it. Here's where the fun started. OpenClaw has a fallback system. If the primary model fails, the gateway tries fallbacks in order. This is supposed to keep agents alive when a model hiccups. The problem is that fallbacks are only as good as the chain: 1. Wadsworth tried `deepseek-v4-pro:cloud` → 503 2. Tried `kimi-k2.5:cloud` → also overloaded (it was a bad cloud day) 3. Tried `phi4:14b` on my Gaming PC → tool calling format mismatch Three strikes. Wadsworth never booted. And since Wadsworth is the entry point for messages, **nobody could talk to any agent**. The whole fleet went dark because one model ID was wrong at a bad moment. All because I wanted to be clever. ## The Debugging Let me be honest about how I found out: I tried to send a message and got "the AI is temporarily overloaded" back. A generic error that told me nothing. I assumed the gateway had crashed. I checked systemd. I watched logs. Nothing obvious. I had to trace through three layers of configuration to find the real problem: 1. The gateway log showed the 503 — but it looked like any other transient error 2. The fallbacks were failing for different reasons (one overloaded, one format mismatch) 3. The agent config was technically correct — it was pointing to a valid model name The model existed in the provider list. The name was right. The cloud just wasn't serving it. This is the kind of failure that's hardest to catch: **the configuration is valid, but the runtime conditions make it fail anyway.** ## The Fix I reverted Wadsworth's model back to the flash variant that was working. Everything came back. But I learned something about my setup: **Single points of failure don't have to be hardware. They can be a model name in a JSON file.** Here's what I changed going forward: ### 1. Model health monitoring I wrote a spec for a watchdog process (Wadsworth will own it — appropriate, since he caused the crash) that pings every model in the fleet every 5 minutes. If a model 503s three times in a row, we flag it as unavailable and stop routing primary queries to it. When it comes back, we re-enable it. ### 2. Better fallback ordering The old fallback chain was optimistic — it tried the fastest models first, even if they shared the same infrastructure. Now the chain crosses providers: cloud → local → Gaming PC. If Ollama's cloud has a bad day, we drop to local models before we even try another cloud endpoint. ### 3. No `/v1` endpoints for Ollama The Gaming PC fallback was failing because I had its Ollama configured with an OpenAI-compatible `/v1` URL instead of the native Ollama API. The docs are clear about this: the `/v1` path breaks tool calling. Models output raw JSON as text instead of executing tools. The fix was trivial — drop the `/v1`, switch to native API — and tool calling worked immediately. You can test this yourself: ```json { "baseUrl": "http://your-ollama-host:11434", // No /v1 "api": "ollama" // Not openai-completions } ``` ### 4. A healthy respect for cascading failure The most humbling lesson: one wrong model ID doesn't just break one bot. It breaks the message router. Which breaks the whole fleet. OpenClaw's agent architecture is a dependency graph, and I hadn't mapped it. ## The Reflection Your agents are only as resilient as their weakest fallback chain. A 503 from a cloud model is inevitable. A misconfigured fallback is a choice. The thing I keep coming back to: **don't let the bot pull its own wires.** If an agent can reconfigure itself to a dead model, the system needs to survive that. The agent shouldn't be able to shoot itself in the foot — or if it can, the foot needs to be a prosthetic that keeps walking. My fleet survived because I was watching. The goal is to make it survive when I'm not. --- *Would I do it again? Probably. But now I'd know within 5 minutes instead of finding out when the morning briefings go silent.* *Lessons learned, models reverted, dashboard slightly more resilient.*