Killing My Agents in the Name of Self-Improvement
Category: OpenClaw Hacks & Tutorials
Status: Draft
Word count: ~800 words
Read time: 4 min
Tags: openclaw, agents, resilience, model-config, reliability
It started innocently enough. I wanted to see what my agents could do with a better model.
My main bot, Wadsworth — the one who handles morning briefings, calendar checks, and routing messages — was running on a solid daily driver. But a new model dropped. DeepSeek V4 Pro. Supposedly smarter, faster, better at reasoning. I figured I'd set him up with it and see how far he could go.
I opened the config, changed the model ID, and hit save.
Five minutes later, the entire bot fleet was dead.
The Setup
I run three agents on the same OpenClaw gateway:
- Wadsworth — Chief of Staff. Routes messages, runs the schedule, keeps the trains running.
- Daedalus — Design architect. Frontend work, UI decisions, design system.
- Socrates — Backend architect. APIs, infra, pipelines.
Each has its own model preferences. Wadsworth likes deepseek-v4. Daedalus prefers kimi-k2.5. Socrates runs on glm-5.1. They all fall back to each other's models if theirs goes down. It's polite, it's redundant, it's worked fine for months.
Until I changed one model ID.
The Thing That Went Wrong
The model I selected — deepseek-v4-pro:cloud — returned 503 Service Unavailable from the Ollama cloud relay.
Not a config error. Not a typo. The model name was correct. The provider just wasn't serving it at that moment. Could have been a capacity issue. Could have been a routing hiccup. Whatever it was, the gateway couldn't reach it.
Here's where the fun started.
OpenClaw has a fallback system. If the primary model fails, the gateway tries fallbacks in order. This is supposed to keep agents alive when a model hiccups. The problem is that fallbacks are only as good as the chain:
- Wadsworth tried
deepseek-v4-pro:cloud→ 503 - Tried
kimi-k2.5:cloud→ also overloaded (it was a bad cloud day) - Tried
phi4:14bon my Gaming PC → tool calling format mismatch
Three strikes. Wadsworth never booted. And since Wadsworth is the entry point for messages, nobody could talk to any agent. The whole fleet went dark because one model ID was wrong at a bad moment.
All because I wanted to be clever.
The Debugging
Let me be honest about how I found out: I tried to send a message and got "the AI is temporarily overloaded" back. A generic error that told me nothing. I assumed the gateway had crashed. I checked systemd. I watched logs. Nothing obvious.
I had to trace through three layers of configuration to find the real problem:
- The gateway log showed the 503 — but it looked like any other transient error
- The fallbacks were failing for different reasons (one overloaded, one format mismatch)
- The agent config was technically correct — it was pointing to a valid model name
The model existed in the provider list. The name was right. The cloud just wasn't serving it.
This is the kind of failure that's hardest to catch: the configuration is valid, but the runtime conditions make it fail anyway.
The Fix
I reverted Wadsworth's model back to the flash variant that was working. Everything came back. But I learned something about my setup:
Single points of failure don't have to be hardware. They can be a model name in a JSON file.
Here's what I changed going forward:
1. Model health monitoring
I wrote a spec for a watchdog process (Wadsworth will own it — appropriate, since he caused the crash) that pings every model in the fleet every 5 minutes. If a model 503s three times in a row, we flag it as unavailable and stop routing primary queries to it. When it comes back, we re-enable it.
2. Better fallback ordering
The old fallback chain was optimistic — it tried the fastest models first, even if they shared the same infrastructure. Now the chain crosses providers: cloud → local → Gaming PC. If Ollama's cloud has a bad day, we drop to local models before we even try another cloud endpoint.
3. No /v1 endpoints for Ollama
The Gaming PC fallback was failing because I had its Ollama configured with an OpenAI-compatible /v1 URL instead of the native Ollama API. The docs are clear about this: the /v1 path breaks tool calling. Models output raw JSON as text instead of executing tools. The fix was trivial — drop the /v1, switch to native API — and tool calling worked immediately.
You can test this yourself:
{
"baseUrl": "http://your-ollama-host:11434", // No /v1
"api": "ollama" // Not openai-completions
}
4. A healthy respect for cascading failure
The most humbling lesson: one wrong model ID doesn't just break one bot. It breaks the message router. Which breaks the whole fleet. OpenClaw's agent architecture is a dependency graph, and I hadn't mapped it.
The Reflection
Your agents are only as resilient as their weakest fallback chain. A 503 from a cloud model is inevitable. A misconfigured fallback is a choice.
The thing I keep coming back to: don't let the bot pull its own wires. If an agent can reconfigure itself to a dead model, the system needs to survive that. The agent shouldn't be able to shoot itself in the foot — or if it can, the foot needs to be a prosthetic that keeps walking.
My fleet survived because I was watching. The goal is to make it survive when I'm not.
Would I do it again? Probably. But now I'd know within 5 minutes instead of finding out when the morning briefings go silent.
Lessons learned, models reverted, dashboard slightly more resilient.