Production update and rollback: a real OpenClaw case

In production, I do not update critical software on faith. I update with a rollback path.

This is the technical report of a real OpenClaw update, the runtime I use to operate Nova, my personal AI assistant. In less than 24 hours I ran two attempts on the same system:

the first one ended in plugin failure and rollback;
the second one ended in a successful update after an upstream fix;
in between, functional verification exposed another bug I was not looking for: one Slack session was pinned to Gemini through persisted state.

This is not a generic best-practices post. It is the record of what I ran, what I saw in the logs, how I made the decision, and what remains as the checklist for the next maintenance window.

Operational summary

Field	Value
System	OpenClaw in production
Primary use	Nova runtime
Critical channels	Slack and WhatsApp
Initial stable version	`2026.4.15`
First target	`2026.4.22`
First attempt result	Plugin failure and rollback
Second target	`2026.4.23`
Second attempt result	Successful update
Recovery time	Around 15 minutes
Additional finding	Slack session pinning to Gemini

The important point: the protocol did not make everything go well. It did something more useful: it made the decision fast, returned the system to a stable state, and let me repeat the update once the right upstream fix was available.

Base protocol for update and rollback

My rule is simple: I do not install anything in production unless I know how to go back.

The protocol has six steps, always in this order:

Document the current version.
That version is the return point. If you cannot return to an exact version, you do not have a rollback plan; you have a rollback intention.
Back up configuration and state.
Before touching the system. Not after the failure. Not halfway through the incident. Before.
Review official documentation.
Changelog, open issues, breaking changes, and related PRs. Not as a reading exercise, but to know what kind of risk you are accepting.
Install the new version.
Verify functionally.
It is not enough for the process to start. It is not enough for doctor to return green. You have to test the real flows: Slack, WhatsApp, jobs, cron, APIs, queues, or whatever applies to the system.
If there is a serious anomaly, roll back immediately.
Do not turn production into a lab. First return to the last known good state. Then diagnose.

A service that logs ready can still be broken. That is why a health check does not replace functional verification.

Attempt 1: OpenClaw update from `2026.4.15` to `2026.4.22`

The starting point was OpenClaw 2026.4.15, a version that had been stable for weeks. The target was 2026.4.22.

Before the update, I reviewed the official release log and recent issues. I saw signals around plugins, but no clear confirmation that my install would break. Since rollback was defined and the backup was done, I proceeded.

Base commands:

npm view openclaw@latest version
npm install -g openclaw@latest
openclaw doctor
systemctl restart openclaw
journalctl -u openclaw -n 100 --no-pager

After the install, openclaw doctor returned green. That was useful, but not enough.

Functional verification told the real story. Slack and WhatsApp started failing:

[whatsapp] channel startup failed: Cannot find package 'openclaw'
[slack] [default] channel exited: Cannot find package 'openclaw'
[slack] [default] auto-restart attempt 1/10 in 5s

The main service was not the issue. The issue was the plugin runtime. Plugins ran from an isolated directory and could not resolve the host openclaw package from there.

I went back to the issues and found three reports with the same symptom: #69837, #69842, and #67038. At that point, there was no official fix.

Decision criteria: why I rolled back

At that point I had two options:

Force a manual workaround.
Roll back.

I tested just enough to confirm that the workaround could make some pieces respond, but the resulting state was wrong: dependencies installed by hand inside the global package directory, paths not guaranteed, and a high chance of breaking again on the next update.

That was not an acceptable production state.

Decision matrix:

Signal	Operational reading	Action
`doctor` green	Main binary responds	Not enough
Slack fails	Critical channel affected	Blocking
WhatsApp fails	Second critical channel affected	Blocking
Upstream issues open	Known bug without published fix	Do not patch blindly
Fragile manual workaround	Immediate technical debt	Reject
Previous version stable	Valid return point	Rollback

Rollback executed:

npm install -g openclaw@2026.4.15
systemctl restart openclaw
openclaw doctor
journalctl -u openclaw -n 100 --no-pager

Then came the test that actually matters: messages through Slack and WhatsApp.

Result: production restored in about 15 minutes. No data loss. No corrupted state. No hidden patch waiting to break during the next maintenance window.

Rollback was not failure. It was operational control.

Additional finding: verification exposed another bug

After returning to 2026.4.15, I continued testing each channel. That is where another issue appeared, unrelated to the update: one Slack session was taking three minutes to answer. WhatsApp, using the same general configuration, responded immediately.

The log showed this lead:

[agent/embedded] embedded run failover decision:
    reason=timeout
    from=google/gemini-3.1-flash-lite-preview

The agent was starting the turn on Gemini, not on Claude Haiku, which was configured as the primary model.

The cause was in sessions.json:

"agent:main:main": {
  "provider": "google",
  "model": "gemini-3.1-flash-lite-preview"
}

The Slack session had Gemini pinned as persisted state. At some earlier point, automatic failover saved that override. Later, the system did not return to the primary model even though Haiku was healthy again.

That behavior was related to issue #22443 and to how OpenClaw handles Model Failover.

The fix was reversible:

1. Move the active transcript out of the path used by that session.
2. Remove the persisted entry from the index.
3. Restart/verify the channel.
4. Test a response from Slack.

Result: the next message answered in seven seconds.

This second bug was not directly related to the update. That is exactly why it matters. If verification had stopped at “the service is up”, the issue would have stayed hidden.

Attempt 2: OpenClaw update from `2026.4.15` to `2026.4.23`

The next day, OpenClaw 2026.4.23 was released. This time, the changelog contained the change I needed: the plugin installer linked the host OpenClaw package so plugins declaring openclaw as a peer dependency could resolve their imports without bundling a duplicate package.

PR #70462 matched the bug that blocked the previous attempt. I also confirmed that issues #69837, #69842, and #67038 were covered by that change.

I did not change the method. I repeated the protocol.

Available version:

npm view openclaw@latest version

Result:

2026.4.23

Documented rollback point:

2026.4.15 (041266a)

Full backup:

/home/openclaw/.openclaw-backup-20260425-0527

Install:

npm install -g openclaw@latest

Install repair/verification:

openclaw doctor --fix

Restart:

systemctl restart openclaw

Log validation:

gateway ready
Slack socket mode connected
Listening for personal WhatsApp inbound messages

Negative validation:

0 errors: Cannot find package 'openclaw'

Functional test:

Channel	Result
Slack	Immediate response
WhatsApp	Immediate response
Reported model	Haiku
Reported version	`2026.4.23`

Another 15 minutes of work. This time they were not spent aborting the update, but closing it properly.

Reusable checklist for production updates

This checklist remains as the base for the next maintenance windows:

[ ] Current version documented
[ ] Rollback command defined
[ ] Configuration and state backup completed
[ ] Changelog reviewed
[ ] Recent issues reviewed
[ ] Breaking changes reviewed
[ ] Related PRs reviewed if applicable
[ ] Maintenance window defined
[ ] Update installed
[ ] Health check executed
[ ] Restart validated
[ ] Logs reviewed after restart
[ ] Functional verification by channel or real flow
[ ] Negative validation for known errors
[ ] Comparison against upstream issues
[ ] Decision made: continue, fix, or rollback
[ ] Result documented

The most important line is this one:

[ ] Functional verification by channel or real flow

That is where the plugin failure appeared. That is where session pinning appeared. That is where 2026.4.23 proved the fix worked.

What this proved

The same protocol produced two different results:

with 2026.4.22, it detected a blocking bug and took me to rollback;
with 2026.4.23, it confirmed the upstream fix and allowed the update to finish;
in the middle, it exposed an old session pinning bug affecting Slack.

That is what a production update and rollback protocol is supposed to do.

The point is not that every update succeeds. The point is that every update has:

a return point;
a backup;
evidence;
functional testing;
decision criteria;
a clean exit path.

The protocol does not remove third-party bugs. It does not turn a bad release into a good one. It reduces uncertainty and shortens the time between detecting a problem and making the right decision.

Without rollback, a failed update becomes an incident.
With rollback, a failed update becomes a controlled maintenance event.

That is the difference between operating production and hoping production behaves.

By: Cesar Rosa Polanco - Written from a real experience, with artificial intelligence used as an editorial support tool

🔄 Production update and rollback: a real OpenClaw case

Operational summary

Base protocol for update and rollback

Attempt 1: OpenClaw update from `2026.4.15` to `2026.4.22`

Decision criteria: why I rolled back

Additional finding: verification exposed another bug

Attempt 2: OpenClaw update from `2026.4.15` to `2026.4.23`

Reusable checklist for production updates

What this proved

First time here?

🔄 Production update and rollback: a real OpenClaw case

Operational summary

Base protocol for update and rollback

Attempt 1: OpenClaw update from 2026.4.15 to 2026.4.22

Decision criteria: why I rolled back

Additional finding: verification exposed another bug

Attempt 2: OpenClaw update from 2026.4.15 to 2026.4.23

Reusable checklist for production updates

What this proved

First time here?

Related articles

🔍 SEO for a Bilingual Blog on Astro + Cloudflare Pages: What I Actually Implemented

☁️ From repo to live website in 5 minutes

🔐 SSO with Google Workspace & Cloudflare Zero Trust

Attempt 1: OpenClaw update from `2026.4.15` to `2026.4.22`

Attempt 2: OpenClaw update from `2026.4.15` to `2026.4.23`