Production update and rollback: a real OpenClaw case
Cloud

🔄 Production update and rollback: a real OpenClaw case

Technical report of two attempts in 24 hours: plugin failure, clean rollback, upstream fix, and successful update

In production, I do not update critical software on faith. I update with a rollback path.

This is the technical report of a real OpenClaw update, the runtime I use to operate Nova, my personal AI assistant. In less than 24 hours I ran two attempts on the same system:

This is not a generic best-practices post. It is the record of what I ran, what I saw in the logs, how I made the decision, and what remains as the checklist for the next maintenance window.

Operational summary

FieldValue
SystemOpenClaw in production
Primary useNova runtime
Critical channelsSlack and WhatsApp
Initial stable version2026.4.15
First target2026.4.22
First attempt resultPlugin failure and rollback
Second target2026.4.23
Second attempt resultSuccessful update
Recovery timeAround 15 minutes
Additional findingSlack session pinning to Gemini

The important point: the protocol did not make everything go well. It did something more useful: it made the decision fast, returned the system to a stable state, and let me repeat the update once the right upstream fix was available.

Base protocol for update and rollback

My rule is simple: I do not install anything in production unless I know how to go back.

The protocol has six steps, always in this order:

  1. Document the current version.
    That version is the return point. If you cannot return to an exact version, you do not have a rollback plan; you have a rollback intention.

  2. Back up configuration and state.
    Before touching the system. Not after the failure. Not halfway through the incident. Before.

  3. Review official documentation.
    Changelog, open issues, breaking changes, and related PRs. Not as a reading exercise, but to know what kind of risk you are accepting.

  4. Install the new version.

  5. Verify functionally.
    It is not enough for the process to start. It is not enough for doctor to return green. You have to test the real flows: Slack, WhatsApp, jobs, cron, APIs, queues, or whatever applies to the system.

  6. If there is a serious anomaly, roll back immediately.
    Do not turn production into a lab. First return to the last known good state. Then diagnose.

A service that logs ready can still be broken. That is why a health check does not replace functional verification.

Attempt 1: OpenClaw update from 2026.4.15 to 2026.4.22

The starting point was OpenClaw 2026.4.15, a version that had been stable for weeks. The target was 2026.4.22.

Before the update, I reviewed the official release log and recent issues. I saw signals around plugins, but no clear confirmation that my install would break. Since rollback was defined and the backup was done, I proceeded.

Base commands:

npm view openclaw@latest version
npm install -g openclaw@latest
openclaw doctor
systemctl restart openclaw
journalctl -u openclaw -n 100 --no-pager

After the install, openclaw doctor returned green. That was useful, but not enough.

Functional verification told the real story. Slack and WhatsApp started failing:

[whatsapp] channel startup failed: Cannot find package 'openclaw'
[slack] [default] channel exited: Cannot find package 'openclaw'
[slack] [default] auto-restart attempt 1/10 in 5s

The main service was not the issue. The issue was the plugin runtime. Plugins ran from an isolated directory and could not resolve the host openclaw package from there.

I went back to the issues and found three reports with the same symptom: #69837, #69842, and #67038. At that point, there was no official fix.

Decision criteria: why I rolled back

At that point I had two options:

  1. Force a manual workaround.
  2. Roll back.

I tested just enough to confirm that the workaround could make some pieces respond, but the resulting state was wrong: dependencies installed by hand inside the global package directory, paths not guaranteed, and a high chance of breaking again on the next update.

That was not an acceptable production state.

Decision matrix:

SignalOperational readingAction
doctor greenMain binary respondsNot enough
Slack failsCritical channel affectedBlocking
WhatsApp failsSecond critical channel affectedBlocking
Upstream issues openKnown bug without published fixDo not patch blindly
Fragile manual workaroundImmediate technical debtReject
Previous version stableValid return pointRollback

Rollback executed:

npm install -g openclaw@2026.4.15
systemctl restart openclaw
openclaw doctor
journalctl -u openclaw -n 100 --no-pager

Then came the test that actually matters: messages through Slack and WhatsApp.

Result: production restored in about 15 minutes. No data loss. No corrupted state. No hidden patch waiting to break during the next maintenance window.

Rollback was not failure. It was operational control.

Additional finding: verification exposed another bug

After returning to 2026.4.15, I continued testing each channel. That is where another issue appeared, unrelated to the update: one Slack session was taking three minutes to answer. WhatsApp, using the same general configuration, responded immediately.

The log showed this lead:

[agent/embedded] embedded run failover decision:
    reason=timeout
    from=google/gemini-3.1-flash-lite-preview

The agent was starting the turn on Gemini, not on Claude Haiku, which was configured as the primary model.

The cause was in sessions.json:

"agent:main:main": {
  "provider": "google",
  "model": "gemini-3.1-flash-lite-preview"
}

The Slack session had Gemini pinned as persisted state. At some earlier point, automatic failover saved that override. Later, the system did not return to the primary model even though Haiku was healthy again.

That behavior was related to issue #22443 and to how OpenClaw handles Model Failover.

The fix was reversible:

1. Move the active transcript out of the path used by that session.
2. Remove the persisted entry from the index.
3. Restart/verify the channel.
4. Test a response from Slack.

Result: the next message answered in seven seconds.

This second bug was not directly related to the update. That is exactly why it matters. If verification had stopped at “the service is up”, the issue would have stayed hidden.

Attempt 2: OpenClaw update from 2026.4.15 to 2026.4.23

The next day, OpenClaw 2026.4.23 was released. This time, the changelog contained the change I needed: the plugin installer linked the host OpenClaw package so plugins declaring openclaw as a peer dependency could resolve their imports without bundling a duplicate package.

PR #70462 matched the bug that blocked the previous attempt. I also confirmed that issues #69837, #69842, and #67038 were covered by that change.

I did not change the method. I repeated the protocol.

Available version:

npm view openclaw@latest version

Result:

2026.4.23

Documented rollback point:

2026.4.15 (041266a)

Full backup:

/home/openclaw/.openclaw-backup-20260425-0527

Install:

npm install -g openclaw@latest

Install repair/verification:

openclaw doctor --fix

Restart:

systemctl restart openclaw

Log validation:

gateway ready
Slack socket mode connected
Listening for personal WhatsApp inbound messages

Negative validation:

0 errors: Cannot find package 'openclaw'

Functional test:

ChannelResult
SlackImmediate response
WhatsAppImmediate response
Reported modelHaiku
Reported version2026.4.23

Another 15 minutes of work. This time they were not spent aborting the update, but closing it properly.

Reusable checklist for production updates

This checklist remains as the base for the next maintenance windows:

[ ] Current version documented
[ ] Rollback command defined
[ ] Configuration and state backup completed
[ ] Changelog reviewed
[ ] Recent issues reviewed
[ ] Breaking changes reviewed
[ ] Related PRs reviewed if applicable
[ ] Maintenance window defined
[ ] Update installed
[ ] Health check executed
[ ] Restart validated
[ ] Logs reviewed after restart
[ ] Functional verification by channel or real flow
[ ] Negative validation for known errors
[ ] Comparison against upstream issues
[ ] Decision made: continue, fix, or rollback
[ ] Result documented

The most important line is this one:

[ ] Functional verification by channel or real flow

That is where the plugin failure appeared. That is where session pinning appeared. That is where 2026.4.23 proved the fix worked.

What this proved

The same protocol produced two different results:

That is what a production update and rollback protocol is supposed to do.

The point is not that every update succeeds. The point is that every update has:

The protocol does not remove third-party bugs. It does not turn a bad release into a good one. It reduces uncertainty and shortens the time between detecting a problem and making the right decision.

Without rollback, a failed update becomes an incident.
With rollback, a failed update becomes a controlled maintenance event.

That is the difference between operating production and hoping production behaves.


By: Cesar Rosa Polanco - Written from a real experience, with artificial intelligence used as an editorial support tool

First time here?

Explore the key topics and articles on this blog.

Start Here →
← Back to articles Available in Spanish →