In production, I do not update critical software on faith. I update with a rollback path.
This is the technical report of a real OpenClaw update, the runtime I use to operate Nova, my personal AI assistant. In less than 24 hours I ran two attempts on the same system:
- the first one ended in plugin failure and rollback;
- the second one ended in a successful update after an upstream fix;
- in between, functional verification exposed another bug I was not looking for: one Slack session was pinned to Gemini through persisted state.
This is not a generic best-practices post. It is the record of what I ran, what I saw in the logs, how I made the decision, and what remains as the checklist for the next maintenance window.
Operational summary
| Field | Value |
|---|---|
| System | OpenClaw in production |
| Primary use | Nova runtime |
| Critical channels | Slack and WhatsApp |
| Initial stable version | 2026.4.15 |
| First target | 2026.4.22 |
| First attempt result | Plugin failure and rollback |
| Second target | 2026.4.23 |
| Second attempt result | Successful update |
| Recovery time | Around 15 minutes |
| Additional finding | Slack session pinning to Gemini |
The important point: the protocol did not make everything go well. It did something more useful: it made the decision fast, returned the system to a stable state, and let me repeat the update once the right upstream fix was available.
Base protocol for update and rollback
My rule is simple: I do not install anything in production unless I know how to go back.
The protocol has six steps, always in this order:
-
Document the current version.
That version is the return point. If you cannot return to an exact version, you do not have a rollback plan; you have a rollback intention. -
Back up configuration and state.
Before touching the system. Not after the failure. Not halfway through the incident. Before. -
Review official documentation.
Changelog, open issues, breaking changes, and related PRs. Not as a reading exercise, but to know what kind of risk you are accepting. -
Install the new version.
-
Verify functionally.
It is not enough for the process to start. It is not enough fordoctorto return green. You have to test the real flows: Slack, WhatsApp, jobs, cron, APIs, queues, or whatever applies to the system. -
If there is a serious anomaly, roll back immediately.
Do not turn production into a lab. First return to the last known good state. Then diagnose.
A service that logs ready can still be broken. That is why a health check does not replace functional verification.
Attempt 1: OpenClaw update from 2026.4.15 to 2026.4.22
The starting point was OpenClaw 2026.4.15, a version that had been stable for weeks. The target was 2026.4.22.
Before the update, I reviewed the official release log and recent issues. I saw signals around plugins, but no clear confirmation that my install would break. Since rollback was defined and the backup was done, I proceeded.
Base commands:
npm view openclaw@latest version
npm install -g openclaw@latest
openclaw doctor
systemctl restart openclaw
journalctl -u openclaw -n 100 --no-pager
After the install, openclaw doctor returned green. That was useful, but not enough.
Functional verification told the real story. Slack and WhatsApp started failing:
[whatsapp] channel startup failed: Cannot find package 'openclaw'
[slack] [default] channel exited: Cannot find package 'openclaw'
[slack] [default] auto-restart attempt 1/10 in 5s
The main service was not the issue. The issue was the plugin runtime. Plugins ran from an isolated directory and could not resolve the host openclaw package from there.
I went back to the issues and found three reports with the same symptom: #69837, #69842, and #67038. At that point, there was no official fix.
Decision criteria: why I rolled back
At that point I had two options:
- Force a manual workaround.
- Roll back.
I tested just enough to confirm that the workaround could make some pieces respond, but the resulting state was wrong: dependencies installed by hand inside the global package directory, paths not guaranteed, and a high chance of breaking again on the next update.
That was not an acceptable production state.
Decision matrix:
| Signal | Operational reading | Action |
|---|---|---|
doctor green | Main binary responds | Not enough |
| Slack fails | Critical channel affected | Blocking |
| WhatsApp fails | Second critical channel affected | Blocking |
| Upstream issues open | Known bug without published fix | Do not patch blindly |
| Fragile manual workaround | Immediate technical debt | Reject |
| Previous version stable | Valid return point | Rollback |
Rollback executed:
npm install -g openclaw@2026.4.15
systemctl restart openclaw
openclaw doctor
journalctl -u openclaw -n 100 --no-pager
Then came the test that actually matters: messages through Slack and WhatsApp.
Result: production restored in about 15 minutes. No data loss. No corrupted state. No hidden patch waiting to break during the next maintenance window.
Rollback was not failure. It was operational control.
Additional finding: verification exposed another bug
After returning to 2026.4.15, I continued testing each channel. That is where another issue appeared, unrelated to the update: one Slack session was taking three minutes to answer. WhatsApp, using the same general configuration, responded immediately.
The log showed this lead:
[agent/embedded] embedded run failover decision:
reason=timeout
from=google/gemini-3.1-flash-lite-preview
The agent was starting the turn on Gemini, not on Claude Haiku, which was configured as the primary model.
The cause was in sessions.json:
"agent:main:main": {
"provider": "google",
"model": "gemini-3.1-flash-lite-preview"
}
The Slack session had Gemini pinned as persisted state. At some earlier point, automatic failover saved that override. Later, the system did not return to the primary model even though Haiku was healthy again.
That behavior was related to issue #22443 and to how OpenClaw handles Model Failover.
The fix was reversible:
1. Move the active transcript out of the path used by that session.
2. Remove the persisted entry from the index.
3. Restart/verify the channel.
4. Test a response from Slack.
Result: the next message answered in seven seconds.
This second bug was not directly related to the update. That is exactly why it matters. If verification had stopped at “the service is up”, the issue would have stayed hidden.
Attempt 2: OpenClaw update from 2026.4.15 to 2026.4.23
The next day, OpenClaw 2026.4.23 was released. This time, the changelog contained the change I needed: the plugin installer linked the host OpenClaw package so plugins declaring openclaw as a peer dependency could resolve their imports without bundling a duplicate package.
PR #70462 matched the bug that blocked the previous attempt. I also confirmed that issues #69837, #69842, and #67038 were covered by that change.
I did not change the method. I repeated the protocol.
Available version:
npm view openclaw@latest version
Result:
2026.4.23
Documented rollback point:
2026.4.15 (041266a)
Full backup:
/home/openclaw/.openclaw-backup-20260425-0527
Install:
npm install -g openclaw@latest
Install repair/verification:
openclaw doctor --fix
Restart:
systemctl restart openclaw
Log validation:
gateway ready
Slack socket mode connected
Listening for personal WhatsApp inbound messages
Negative validation:
0 errors: Cannot find package 'openclaw'
Functional test:
| Channel | Result |
|---|---|
| Slack | Immediate response |
| Immediate response | |
| Reported model | Haiku |
| Reported version | 2026.4.23 |
Another 15 minutes of work. This time they were not spent aborting the update, but closing it properly.
Reusable checklist for production updates
This checklist remains as the base for the next maintenance windows:
[ ] Current version documented
[ ] Rollback command defined
[ ] Configuration and state backup completed
[ ] Changelog reviewed
[ ] Recent issues reviewed
[ ] Breaking changes reviewed
[ ] Related PRs reviewed if applicable
[ ] Maintenance window defined
[ ] Update installed
[ ] Health check executed
[ ] Restart validated
[ ] Logs reviewed after restart
[ ] Functional verification by channel or real flow
[ ] Negative validation for known errors
[ ] Comparison against upstream issues
[ ] Decision made: continue, fix, or rollback
[ ] Result documented
The most important line is this one:
[ ] Functional verification by channel or real flow
That is where the plugin failure appeared. That is where session pinning appeared. That is where 2026.4.23 proved the fix worked.
What this proved
The same protocol produced two different results:
- with
2026.4.22, it detected a blocking bug and took me to rollback; - with
2026.4.23, it confirmed the upstream fix and allowed the update to finish; - in the middle, it exposed an old session pinning bug affecting Slack.
That is what a production update and rollback protocol is supposed to do.
The point is not that every update succeeds. The point is that every update has:
- a return point;
- a backup;
- evidence;
- functional testing;
- decision criteria;
- a clean exit path.
The protocol does not remove third-party bugs. It does not turn a bad release into a good one. It reduces uncertainty and shortens the time between detecting a problem and making the right decision.
Without rollback, a failed update becomes an incident.
With rollback, a failed update becomes a controlled maintenance event.
That is the difference between operating production and hoping production behaves.
By: Cesar Rosa Polanco - Written from a real experience, with artificial intelligence used as an editorial support tool