An OS That Bricks on Update Is Already Dead

“Updating, please do not turn off your device.”

Your cat scurries under the desk and steps on the surge protector. Oops! You might need to do a system restore. The fact that companies treat this as an acceptable risk is yet more evidence that it is time to make a new operating system.

Under the hood, the system is rewriting itself in place, on top of the only copy it has. The system does not know how to change itself safely, and simply hopes that nobody interrupts it. Mitigations have been made to reduce the risk of catastrophic failures, but that risk has not yet been eliminated.

As with most operating system deficiencies, have known how to do better for decades.

How we got here

The cartoon version of this story where everyone is an idiot is not caritable nor completely accurate. That said, everyone is an idiot (except me).

In-place update is fairly simple to build. Stop the service, overwrite the files, restart. Of course, handling the edge cases is painful. One must ensure that the new code accepts the old on-disk state, and drain in-flight work. This is brittle and requires extreme discipline to make this bug-free. With the wrong tools, and wrong language, its near impossible to get right.

And for precisely these reasons, the field already moved past this.

Android and ChromeOS ship A/B partition schemes, sometimes called seamless updates. The idea: keep two complete copies of the system, slot A and slot B. You are running on A. The updater writes the whole new system into B while you keep using A, untouched. When B is fully written and verified, a tiny piece of state (a flag in the bootloader) flips to say “boot B next time.” Reboot, and you are on B. If B fails to boot a few times, the bootloader automatically rolls back by flipping the flag to A. The slot you were running was never modified, so there is always a known-good copy to fall back to.

This works. It is atomic at the moment that matters (the flag flip), it is recoverable, and it removed the bricking failure mode for a billion phones. If you are going to copy anyone, copy this.

However, A/B pays a real tax to get safety. You carry two full system partitions, so you spend roughly double the system storage. The two slots are otherwise-identical bytes sitting in separate places. The atomicity comes from a separate bootloader flag protocol. However, the correct fix is at the filesystem level.

What if atomic, rollback-safe update were not a bolt-on feature, but baked into how the disk already works?

Immutable and content-addressed, so an update adds, never overwrites

Cathedral splits the disk into realms, independent rooted namespaces, and the one that holds the operating system is the system realm. The system realm is immutable and content-addressed. Two words doing enormous work, so let me unpack them.

Content-addressed means a file’s name in the store is the cryptographic hash of its bytes. The bytes determine the address. Identical bytes have the same address and are stored once. Directories are content-addressed too: a directory’s content is its list of name -> hash entries, so the directory has a hash, and that hash transitively names its entire subtree. The hash of the system realm’s root names the exact, complete operating system, top to bottom, in one 256-bit number.

Immutable means you never edit a stored object. You cannot. To “change” a file you write a new object with new bytes (new hash) and write new directory nodes along the path from that file up to the root, each pointing at the new child. Everything you did not touch keeps its old hash and is simply pointed at again. This is copy-on-write, and because the store is a Merkle DAG (a graph where each node’s hash is computed from its children’s hashes), the new root and the old root coexist, sharing every unchanged node between them.

Now watch what an update becomes.

Installing a new OS version writes a new system tree: new objects for what changed, reused objects for the vast majority that did not, and a new root hash naming the whole new version. The old root still exists. It was never touched, because immutability means it cannot be. At no point on disk is there a half-old, half-new chimera. There is the old committed system, and there is the new committed system, and a single reference deciding which one is live.

That reference lives in the superblock, the tiny fixed header the boot chain reads first. It holds a root_ref: the hash of the system root to boot. Committing the update is pointing root_ref at the new root. Rolling back is pointing root_ref at the previous root.

Rollback is a pointer swap

Here is the punchline. Rollback is repointing the boot record at the previous known-good root. That is the whole operation.

The old system realm was never overwritten, so rolling back does not restore anything. Nothing gets copied back. Nothing gets reconstructed. You change which root hash the superblock names, commit that change, and on the next boot the machine comes up on the exact bytes it ran yesterday, because those bytes never left. This is closer to swapping a pointer than to restoring a backup, and the difference is not rhetorical. A restore is O(size of system). A pointer swap is O(1). Rollback is atomic and cheap because the storage model already kept both versions and already addresses them by a single hash.

Compare the tax. A/B got atomic rollback by paying for two whole partitions up front and wiring a separate flag protocol to choose between them. Cathedral gets the same guarantee for free, because content-addressing already shares the unchanged bytes between versions (you are not paying double for two near-identical copies, you are paying once plus the delta) and the “which version is live” choice is already a field in the superblock the boot chain has to read anyway. A/B manufactures the property. Cathedral inherits it from the disk.

What happens if the power blinks

Now answer the question the progress bar exists to dodge. The power dies mid-update. What state is the machine in?

Two facts make this boring instead of terrifying.

First, the superblock only ever points at a committed root. A commit is a single atomic step (the transaction either lands or it does not), and until it lands, root_ref still names the old system. A new tree that was half-written when the lights went out is just a pile of unreferenced objects. Nothing points at it.

Second, on the next boot the store replays its log and stops at the last fully committed transaction, discarding the partial one. The half-written update is unreachable garbage, collected later, never booted. The machine comes up on the old system, exactly as if the update had never started.

So the genuine worst case is losing the most recent uncommitted work. Not a brick. Not a corrupt tree. The boot chain literally cannot select a half-written system, because the only thing it will follow is a committed root_ref, and a torn update never committed one. The failure the warning exists to dread is structurally impossible.

This is the same machinery the live system uses to move a folder to another drive or hot-swap a running component: drive to a safe point, switch a reference inside a transaction, and a power loss leaves either the old state or the new state, never a corrupt half. Update is an ordinary transaction that happens to be about the OS. There is no special dangerous mode.

The fallback that the update mechanism cannot kill

There is one more failure to handle: the update committed cleanly, the new system is genuinely broken, and rollback’s previous root is broken too. For that, Cathedral keeps a recovery image: a known-good fallback used when the primary path will not come up at all.

The hard requirement on a recovery image is the one everyone gets wrong. It has to be updatable enough to stay useful over the years, yet isolated enough that the very update mechanism it backs up cannot take it down along with the main system. A recovery partition that the normal updater can write to is a recovery partition the normal updater can corrupt. The thing that is supposed to save you must not share a failure with the thing it is saving you from.

Cathedral’s fuller answer reuses storage primitives instead of inventing a special partition. You can tag the system realm with a mirrored placement class: keep N copies of it across every enrolled drive. Because the system realm is immutable and content-addressed, mirroring it is trivial (copy once, the newest version wins, and a drive that already holds a given hash copies nothing). Every enrolled drive becomes a complete, self-contained boot drive. The recovery image is then just the degenerate one-copy case of “every drive can boot.” You are not maintaining a fragile special-purpose rescue partition. You are maintaining redundancy of a thing that, being immutable, is cheap to replicate and impossible to half-write.

And the boot itself is measured

Knowing you can roll back is half the story. The other half is knowing whether you should: detecting that a boot went wrong, ideally before it hands control to something compromised.

Cathedral boots through a measured trust chain anchored in hardware. Each stage (firmware, bootloader, kernel, the privileged components) does two things to the next stage before running it: it verifies a signature, and it measures the stage by extending a hash of it into a hardware register that software cannot roll back or forge. The signature says the stage is authorized. The measurement records what actually booted, so the running state is both authorized and attestable. A stage that cannot verify the next one refuses to continue and drops into recovery rather than running something it cannot vouch for.

This closes the loop with rollback in two directions. A measurement mismatch is a concrete, hardware-witnessed signal that boot went wrong, which is exactly when you want to fall back to the previous root. And it guards the dual hazard: an attacker must not be able to force a downgrade to an older, validly-signed but known-vulnerable version. Legitimate rollback and malicious downgrade look similar (both go backward), and the honest tension Cathedral takes on is allowing the first while forbidding the second, which it reasons about using the monotonic version facts the package system records. Going back is a right. Being shoved back is an attack. The system has to tell them apart.

(Two caveats I will not paper over. The firmware below the chain is hardware you do not control, so the very bottom of trust is a boundary Cathedral specifies against real silicon rather than owns. And measured boot proves what image booted, not the full set of authorities the live system later holds. The chain is a strong floor, not a total proof.)

The mental model

The progress bar exists because the old system has exactly one copy of itself and edits that copy in place. Every guarantee you want, atomicity, rollback, recoverability, has to be manufactured on top of a storage model that does not provide it: a second partition, a bootloader flag, a backup you restore by hand.

Cathedral’s storage model provides it for free. The system is immutable and named by a hash, so an update is a new tree beside the old one, a commit is a pointer reaching the new root, and a rollback is that pointer reaching the old one. There is never a half-written system to boot, because the boot chain only follows a committed reference.

An update that can brick you is a system that does not know how to change itself. A system that changes itself by swapping a pointer between two complete, immutable versions cannot brick on update, because there was never a moment when no complete version existed.