# Killing My Agents in the Name of Self-Improvement

**Category:** OpenClaw Hacks & Tutorials
**Status:** Draft
**Word count:** ~800 words
**Read time:** 4 min
**Tags:** `openclaw`, `agents`, `resilience`, `model-config`, `reliability`

---

It started innocently enough. I wanted to see what my agents could do with a better model.

My main bot, Wadsworth — the one who handles morning briefings, calendar checks, and routing messages — was running on a solid daily driver. But a new model dropped. DeepSeek V4 Pro. Supposedly smarter, faster, better at reasoning. I figured I'd set him up with it and see how far he could go.

I opened the config, changed the model ID, and hit save.

Five minutes later, the entire bot fleet was dead.

## The Setup

I run three agents on the same OpenClaw gateway:

- **Wadsworth** — Chief of Staff. Routes messages, runs the schedule, keeps the trains running.
- **Daedalus** — Design architect. Frontend work, UI decisions, design system.
- **Socrates** — Backend architect. APIs, infra, pipelines.

Each has its own model preferences. Wadsworth likes deepseek-v4. Daedalus prefers kimi-k2.5. Socrates runs on glm-5.1. They all fall back to each other's models if theirs goes down. It's polite, it's redundant, it's worked fine for months.

Until I changed one model ID.

## The Thing That Went Wrong

The model I selected — `deepseek-v4-pro:cloud` — returned **503 Service Unavailable** from the Ollama cloud relay.

Not a config error. Not a typo. The model name was correct. The provider just wasn't serving it at that moment. Could have been a capacity issue. Could have been a routing hiccup. Whatever it was, the gateway couldn't reach it.

Here's where the fun started.

OpenClaw has a fallback system. If the primary model fails, the gateway tries fallbacks in order. This is supposed to keep agents alive when a model hiccups. The problem is that fallbacks are only as good as the chain:

1. Wadsworth tried `deepseek-v4-pro:cloud` → 503
2. Tried `kimi-k2.5:cloud` → also overloaded (it was a bad cloud day)
3. Tried `phi4:14b` on my Gaming PC → tool calling format mismatch

Three strikes. Wadsworth never booted. And since Wadsworth is the entry point for messages, **nobody could talk to any agent**. The whole fleet went dark because one model ID was wrong at a bad moment.

All because I wanted to be clever.

## The Debugging

Let me be honest about how I found out: I tried to send a message and got "the AI is temporarily overloaded" back. A generic error that told me nothing. I assumed the gateway had crashed. I checked systemd. I watched logs. Nothing obvious.

I had to trace through three layers of configuration to find the real problem:

1. The gateway log showed the 503 — but it looked like any other transient error
2. The fallbacks were failing for different reasons (one overloaded, one format mismatch)
3. The agent config was technically correct — it was pointing to a valid model name

The model existed in the provider list. The name was right. The cloud just wasn't serving it.

This is the kind of failure that's hardest to catch: **the configuration is valid, but the runtime conditions make it fail anyway.**

## The Fix

I reverted Wadsworth's model back to the flash variant that was working. Everything came back. But I learned something about my setup:

**Single points of failure don't have to be hardware. They can be a model name in a JSON file.**

Here's what I changed going forward:

### 1. Model health monitoring

I wrote a spec for a watchdog process (Wadsworth will own it — appropriate, since he caused the crash) that pings every model in the fleet every 5 minutes. If a model 503s three times in a row, we flag it as unavailable and stop routing primary queries to it. When it comes back, we re-enable it.

### 2. Better fallback ordering

The old fallback chain was optimistic — it tried the fastest models first, even if they shared the same infrastructure. Now the chain crosses providers: cloud → local → Gaming PC. If Ollama's cloud has a bad day, we drop to local models before we even try another cloud endpoint.

### 3. No `/v1` endpoints for Ollama

The Gaming PC fallback was failing because I had its Ollama configured with an OpenAI-compatible `/v1` URL instead of the native Ollama API. The docs are clear about this: the `/v1` path breaks tool calling. Models output raw JSON as text instead of executing tools. The fix was trivial — drop the `/v1`, switch to native API — and tool calling worked immediately.

You can test this yourself:

```json
{
  "baseUrl": "http://your-ollama-host:11434",  // No /v1
  "api": "ollama"                                // Not openai-completions
}
```

### 4. A healthy respect for cascading failure

The most humbling lesson: one wrong model ID doesn't just break one bot. It breaks the message router. Which breaks the whole fleet. OpenClaw's agent architecture is a dependency graph, and I hadn't mapped it.

## The Reflection

Your agents are only as resilient as their weakest fallback chain. A 503 from a cloud model is inevitable. A misconfigured fallback is a choice.

The thing I keep coming back to: **don't let the bot pull its own wires.** If an agent can reconfigure itself to a dead model, the system needs to survive that. The agent shouldn't be able to shoot itself in the foot — or if it can, the foot needs to be a prosthetic that keeps walking.

My fleet survived because I was watching. The goal is to make it survive when I'm not.

---

*Would I do it again? Probably. But now I'd know within 5 minutes instead of finding out when the morning briefings go silent.*

*Lessons learned, models reverted, dashboard slightly more resilient.*