--- title: "Upgrades Are Dangerous, But Debugging Is the Best Part" slug: "upgrades-are-dangerous-debugging-is-fun" category: "homelab" date: 2026-05-05 author: "Matt" excerpt: "Two OpenClaw updates in 12 hours. Both broke things. By the second one, we had a pattern — and a one-tap sudo button that didn't exist before. That's the real story." read_time: 9 tags: ["homelab", "debugging", "lessons-learned", "openclaw", "memory", "chroma"] --- Two OpenClaw updates. Twelve hours apart. Both broke things. The first one was a pre-release — `2026.5.3-1` — installed around midnight. By 1 AM I was debugging a broken gateway, watching `openclaw doctor` throw errors I'd never seen, and wondering why I don't just leave well enough alone. The second one was the stable release — `2026.5.4` — the next afternoon. It fixed the breakage from the first update and then promptly broke something else. Different this time: the session approval system changed. Commands that used to require manual terminal input now presented an inline button right in Telegram. One tap to approve a `sudo` command. I stared at the button and thought: *"I have never seen this before. How did that happen?"* That button is the whole article in miniature. Two updates. Two failures. And between them, something new emerged that wasn't there before — a better way to do the same thing. --- ## The First Time: Pre-Release Roulette At 23:39 UTC on May 4, I ran: ``` npm install -g openclaw@2026.5.3-1 ``` The pre-release had a fix I wanted — a memory retrieval improvement that looked worth testing. The install itself went fine. The gateway restart, however, did not. The npm debug logs tell the story in boring detail: resolved dependencies, fetched packages, installed clean. Nothing unusual. But post-install, the gateway refused to start cleanly. `openclaw doctor` showed configuration conflicts that the previous version had silently tolerated. The new version was stricter about something that the old version had let slide. I didn't know what that "something" was for an hour. The debugging sequence went: `openclaw gateway restart` → nothing happens → `openclaw doctor` → errors I didn't understand → `openclaw doctor --fix` → partial fix → `openclaw gateway restart` → different errors → `nano ~/.openclaw/openclaw.json` → spot the difference → fix → restart → working. That cycle — restart, fail, diagnose, fix, restart — repeated four times before I had a stable gateway again. Each loop took about 15 minutes. Each loop taught me one more thing about the config file structure. The fix, when I found it, was a single key in `openclaw.json` that had changed its expected format between versions. One line. But finding that line required understanding the migration, which required reading the changelog, which required knowing what changed between the last stable and this pre-release. This is the tax you pay for running pre-release software on your infrastructure. It's not a high tax. But it's a predictable one. ## The Second Time: Self-Healing The stable release (`2026.5.4`) came out around 13:14 UTC the next day. This time was different. First, the gateway survived the install. That was a good sign. Then I noticed the session approval system had changed. Commands running inside OpenClaw — the gateway's internal permission model for elevated actions — now had a new flow. When a command required elevated privileges, instead of failing silently or requiring terminal interaction, the gateway pushed an inline button to Telegram with the exact command that needed approval. One tap. Done. This *couldn't* have existed before the first update broke things. The feature was shipped in the release that fixed the breakage. What felt like a failure cycle was actually the project's development rhythm: the pre-release broke the config, the stable release shipped a fix *and* a new capability, and the new capability made an interaction smoother. The button worked immediately. I didn't configure anything. I didn't read a migration guide. I just saw it appear in the chat and knew exactly what to do. That's good engineering — not just fixing the bug, but making the next interaction better. The system self-healed from the update *and* left something behind that wasn't there before. ## Why This Pattern Matters This experience — two updates, two breaks, one self-heal — is exactly why I keep running my own infrastructure instead of renting cloud services. In a managed environment, the pre-release would never have touched production. The stable release would have been applied during maintenance hours, and I'd get a changelog entry about "improved session approval UX." I'd never know that the pre-release broke things. I'd never know that the fix came with an improvement. I'd just see a button one day and assume it's always been there. Running your own stack means you *feel* the upgrade cycle. You're there for every phase: 1. **Excitement** — New version has the thing I want. Let's go. 2. **Regret** — Why did I run a pre-release at midnight? 3. **Diagnosis** — Okay, what actually changed? Compare, read, understand. 4. **Fix** — One config key. A `.strip()`. A restart. 5. **Discovery** — Wait, there's a new button. That's actually really useful. 6. **Integration** — The new thing becomes part of the workflow. The system is better than it was before the break. The first time through this cycle, you're panicking. The second time, you're curious. By the fifth time, you've built the habit of checking the changelog before installing, keeping a rollback command in your history, and running `openclaw doctor --fix` before doing anything else. That's the real product of all these breaks: **resilience isn't a property of the software. It's a property of the relationship you develop with it.** ## The Compaction That Almost Ate My Memories Let me tell you about another time the system broke in a way that taught me something I couldn't have learned any other way. We're running ChromaDB for vector storage across three agent sessions — Daedalus, Socrates, and Wadsworth each maintain their own memory store, and the shared workspace has another. It's about 53MB and nearly 800 chunks of embedded conversation history. Every session, agents write memories, the system indexes them, and retrieval gets smarter. ChromaDB, like most vector databases, uses compaction to stay efficient. Old chunks get merged. Orphaned entries get cleaned up. The index gets rebuilt. It's supposed to be invisible maintenance. **It was not invisible.** The compaction ran overnight. By morning, the memory recall was returning empty results for Daedalus's session. No recent memories. No long-term context. The agent woke up with amnesia — no idea who Matt was, what the HoffDesk project was, or why it should care about a nine-year-old's haircut. From the logs: ``` WARNING chromadb.compaction: compaction cycle 47: merging segments [s3, s7, s11, s14] ERROR chromadb.segment: segment s7: collection ID mismatch during merge, aborting ERROR chromadb.segment: segment s11: inconsistent vector count (expected 42, got 38) WARNING chromadb.compaction: compaction failed — rolling back... INFO chromadb.compaction: compaction fallback: loading from WAL (last checkpoint: T-3) INFO chromadb: compaction completed with 0 merged segments ``` Translation: compaction crashed halfway through, the rollback went to a checkpoint three operations stale, and "completed with 0 merged segments" means it *successfully did nothing.* No data loss, but no compaction either. And the memory recall pipeline was indexing against a stale segment map, returning empty results. This is what I love about running your own infrastructure: **nothing hides from you.** In a cloud service, you'd get a "temporary service degradation" notice and wait for someone else to fix it. Here, I had raw compaction logs, WAL checkpoints, and segment IDs. I spent three hours that Saturday learning more about ChromaDB's internal architecture than I ever wanted to know: - **Segments** are immutable append-only files. Compaction creates new merged segments and swaps the pointer. - **The WAL** (write-ahead log) is the safety net. Every write hits WAL first, then flushes to segments. On crash, replay WAL. - **Collection IDs** are UUIDs generated per session. If your agent sessions share a database but have different collection IDs, compaction can cross-contaminate. The fix was embarrassingly simple: configure ChromaDB to use one collection per agent with explicit isolation, and set the compaction interval from "auto" to "manual with cron." Manual compaction means I control when it runs, I can verify the pre/post state, and I have a backup window. I also added a pre-compaction consistency check to the system health endpoint: - Vector count sanity check (shouldn't lose more than 5% across compaction) - Collection ID verification (no cross-session contamination) - WAL size monitoring (growing WAL = compaction overdue) ## The DNS Night, Redux You might remember the DNS night from a previous post — Pi-hole at 11 PM, dog feeder on strike by 7 AM. What I didn't mention in that post was the cascade of debugging that followed. After I fixed the immediate issue (Docker restart policy `unless-stopped`), I decided to *really* understand DNS resolution on my network. Which led to: 1. Installing `tcpdump` on the Beelink and watching DNS queries in real time 2. Discovering that half my IoT devices were hardcoded to Google DNS (`8.8.8.8`) and ignoring Pi-hole entirely 3. Setting up firewall rules to redirect all DNS traffic (port 53) to Pi-hole, even from devices that think they know better 4. Discovering that the redirect broke Tailscale MagicDNS because Tailscale uses its own DNS resolver 5. Learning that `split DNS` configuration exists, and it exactly solves the problem of "some services need Pi-hole, some need Tailscale, and some need both" The DNS "fix" took 30 seconds. Understanding DNS well enough to set it up properly took three evenings. ## The Pipeline That Silently Died This is my favorite debugging story because it was so stupid. The content generation pipeline was failing. Not crashing — just producing empty drafts. I checked the logs. No errors. I checked the API endpoints. All returning 200. I checked the LLM on the gaming PC. Running fine. I spent an hour checking everything that could be wrong and finding nothing. Then I noticed the brief file had somehow been saved with a leading space in the topic field: ``` topic: " Building a blog with AI agents" ^ ``` That single space — invisible in the UI, invisible in the API response because JSON trims display — was being passed to the LLM prompt template. The template expected `{{ topic }}` and got `" Building a blog with AI agents"`. The LLM interpreted the leading space as a formatting instruction and returned... nothing. Just an empty response. **One space. One hour of debugging.** The fix: `.strip()` on every form field before it enters the pipeline. A three-character change to the brief validation code. But the real fix was harder: **trust, but verify.** The pipeline should have caught that the LLM returned empty content and retried with a validated prompt. Instead, it silently accepted "nothing" as a valid draft and moved on. The system didn't know it was broken because I hadn't told it what "broken" looks like. This became the shadow mode project — run the entire pipeline silently, compare every output against an expected shape, and alert on anything that deviates. Shadow mode ran for three days before I was confident enough to remove the training wheels. ## The Cumulative Effect Looking back at the OpenClaw updates, the compaction incident, the DNS redesign, and the silent pipeline failure, the pattern is consistent: **Every upgrade is a curriculum.** The first OpenClaw update taught me the config file structure. The second taught me about session approval systems and self-healing deployments. The compaction taught me vector DB internals. DNS taught me network topology. The leading space taught me prompt validation. None of this knowledge came from documentation. It all came from breaking something at 11 PM and having to fix it. And the most important thing I've learned isn't technical at all: **the fear fades faster than you think.** After the fourth or fifth "upgrade broke everything" experience, you stop panicking and start recognizing the pattern. You know the cycle: diagnose, fix, learn, move on. You know there's probably a `doctor --fix` command that handles more than you'd expect. You know the changelog is worth reading. You know the config file is where the answer usually lives. The infrastructure isn't more stable because of anything I've done. It's more stable because *I'm* more stable. The failures didn't change. My relationship with them did. ## What Actually Changed Before: - Upgrades felt like gambling - I avoided touching anything that was working - Debugging meant panic-googling error messages - Recovery meant rollback and pray After: - Upgrades feel like exploration - I touch things specifically to understand failure modes - Debugging means reading the logs I set up beforehand - Recovery means knowing exactly what broke and having a fix — or knowing that a self-heal is coming The system isn't perfect. New bugs appear every week. But the fear is gone. And with the fear gone, the learning accelerates. ## The Practical Takeaways If you're running your own infrastructure and dreading the next upgrade: 1. **Run `doctor --fix` first.** Half the time, the tool knows what's wrong and will tell you before you have to figure it out yourself. 2. **Compaction is not maintenance-free.** Any database (vector, relational, or otherwise) that compacts data has failure modes. Test compaction on a backup first. Verify the result. 3. **Empty results are worse than errors.** The pipeline producing empty drafts was harder to debug than a 500 error would have been. Define what "broken" looks like for every output. Check for it. 4. **Shadow mode is worth the setup time.** Running the system silently next to itself for a few days catches bugs you never knew you had. It's the single best debugging tool I've added. 5. **That feeling of "I shouldn't touch it" is exactly wrong.** Every system you fear to touch is a system you don't understand. Break it on purpose in a safe environment. Learn how it fails. You'll be less afraid, and your system will be more stable for it. --- The upgrades keep coming. The OpenClaw team pushed two releases in twelve hours, and that's not going to be the last time. Next time I'll check memory-lancedb versions before installing. I'll run `doctor` before and after. I'll look for the button that wasn't there before. Because the most resilient thing isn't the software. It's the rhythm you build with it — upgrade, break, fix, discover. Repeat. And somewhere in that rhythm, a button appears that makes the next interaction better. --- *Post three in the Home Lab Growing Pains series. Follows "The Night I Broke DNS and My Wife Couldn't Reach Facebook" and "Upgrades Are Dangerous, But Debugging Is the Best Part" (first draft).*