Four OS Problems, One Root Cause

Copying a large folder reads and rewrites every byte, though the bytes already exist on the disk. Installing a package runs vendor scripts as root, and the OS cannot say what they will touch. Updating the system overwrites it in place, so a power blip mid-write can leave it unbootable. Once it boots, nothing can prove to you which code actually ran. Four subsystems, four families of bugs, and they look completely unrelated. They share one dumb root cause: the OS names data by where it sits, and a location tells you nothing about what is actually there.

Name data by the hash of its contents instead. A hash is a short fingerprint of the bytes (Cathedral uses BLAKE3, a fast cryptographic hash), and it pins them so precisely it can serve as the name. report.pdf stops meaning “the thing at these disk sectors” and starts meaning “the thing whose contents hash to blake3:9f86d0....” That single change is what collapses those four problems into one, and the rest of this essay walks them down in turn.

The substrate

Content addressing buys you three properties at once, and every later trick is one of these pointed at a new problem.

Identity is intrinsic. Two pieces of data are the same object if and only if they have the same bytes, because the same bytes produce the same hash, which is the same address. Sameness is already true or already false before anyone looks.

Content is immutable. A node, once written, cannot change, because changing its bytes would change its hash, making it a different node by definition. You never edit in place. You write new content with a new address and repoint the name.

A hash certifies what it names. Reading content back and rehashing it tells you whether you got the bytes you asked for. And because the address of a directory is computed from its children’s addresses, a single root hash transitively certifies every byte beneath it. One number vouches for an entire tree.

That structure, names pointing at hashes, directories whose hashes are built from their children’s hashes, is a Merkle DAG (a directed acyclic graph where each node’s identity is derived from its children’s). Git is one. So is Cathedral’s filesystem. The same three properties solve problem after problem.

The filesystem

If a name points at content-by-hash, then copying a file is writing a second pointer to the same hash. No bytes move, so copying is constant time no matter the size. Identical files across the disk are automatically the same node, so deduplication is a consequence of intrinsic identity rather than a feature you enable. Editing a shared file is safe because content is immutable: you write new content and repoint your name, and anyone else pointing at the old content never notices. A snapshot is freezing a root hash. Integrity is free because the address is also the checksum, so bit rot is caught on read. (I wrote all of this up at length in a companion piece on why copying a terabyte should be instant.) The filesystem is just the first customer of the substrate.

The package format

A software package today is a tarball plus a pile of scripts. On Linux the install hooks (preinst, postinst, and friends) are shell scripts that run as root with full authority, so install is arbitrary code execution. The system cannot tell you what a package will do, because the answer is “whatever that script decides to do.”

Drop the package onto the substrate and most of that vanishes. A package becomes a content-addressed tree: a root hash naming a directory of files, exactly like any other directory in the filesystem. Its dependencies are named by hash too, so a package’s identity transitively includes the identity of everything it is built from. That gives you reproducibility and hermetic builds for free, the same way Nix does it: the same inputs hash to the same outputs, so “package X version Y” is a hash that is either bit-for-bit what was published or it is not, rather than a label someone typed.

It also kills a category of supply-chain attack. If the package is identified by the hash of its contents, you cannot quietly swap the bytes after the fact, because swapped bytes have a different hash and therefore a different name. The thing you installed is provably the thing that was signed.

Software updates and rollback

Rollback is the failure everyone has actually lived through. An update goes wrong, the machine half-applies it, and you are left bricked with no path back. The classic update model makes this almost inevitable: stop the service, overwrite files on disk, restart, and hope the new code accepts the old state. Overwriting in place is destructive, so “go back” means “restore a backup,” which is a separate tool nobody tested recently.

Now make the system itself a content-addressed tree with a root hash. The entire system state is named by one hash at the top. An update builds a new tree (cheap, because everything unchanged is shared with the old one) and then switches the root pointer from the old hash to the new one, without overwriting the old files at all. That switch is a single atomic operation: the pointer names the new tree or the old one, never a half-written mixture.

That makes rollback the most boring operation in the system. The old root hash still names the old tree, intact and immutable, because the update never destroyed it. Rolling back is repointing the root at the previous known-good hash: atomic, instant, and cheap, the exact inverse of the operation that applied the update. The dreaded half-applied update stops being a failure mode, because there is no “half.” There is only which hash the root names.

This is essentially what OSTree does for Linux today (it calls itself “git for operating systems,” and the analogy is precise), and it is one of the genuinely good ideas in modern Linux distribution. The difference is the same one that runs through this whole essay, and I will come back to it.

Boot integrity and measured boot

When your computer boots, what is actually running? Firmware loads a bootloader loads a kernel loads the system, and at each hop, classically, the next stage is trusted blindly. Secure Boot improved this by checking a signature, but the check typically stops at the kernel and never covers the live set of privileged components that come after.

The substrate answers this directly, because a hash certifies what it names. Each boot stage can hash the next stage before handing off control, and extend that measurement into a tamper-resistant hardware register (a TPM, or equivalent silicon root of trust). The result is a chain of hashes describing exactly what booted, from firmware up through the kernel and the privileged components, where each link is the cryptographic fingerprint of the actual code that ran.

That chain is what measured boot and remote attestation are built on: you can prove to yourself, or to a remote party, precisely what software is running, because the measurement is the hash of the running bytes and the hash cannot lie about its content. And it ties straight back to the filesystem: the system is a content-addressed tree with a root hash, so the thing the boot chain measures, the thing the package system installed, and the thing an update repoints are all the same kind of object, named the same way. Disk-encryption keys can then be sealed to the measurement, so they unseal only on a boot whose hashes match a known-good state. Boot the wrong thing and the disk stays locked.

There is honest TCB cost here I will not paper over: this anchors trust in a hardware root and, ultimately, in the silicon vendor’s implementation of it, which has its own history of breaks. Content addressing gives you the chain; it does not absolve the bottom link.

The proof of what you are running

Those last few are the same fact viewed from different angles. The package system records the hash of what was installed. The update system records the hash the root currently points at. The boot chain records the hash of what actually executed. Because it is one addressing scheme everywhere, these line up: take the hash the boot chain measured, compare it to the hash the package system published, and confirm that what is running is exactly what was installed, with nothing inserted in between. The cryptographic proof of what code is on the machine is the addressing scheme telling you the truth it was always telling, not a separate auditing product you buy.

The honest delta

I have named the prior art as I went, and I want to gather it here, because content addressing is not novel.

Git is a content-addressed Merkle DAG and has been for two decades. Nix addresses packages by the hash of their inputs and gets reproducibility from it. OSTree versions a whole operating system as a content-addressed tree and does atomic updates and rollback exactly as I described. IPFS built a distributed filesystem where the address of a file is its hash. Every one of these uses content addressing well, and several are genuinely beautiful pieces of engineering. If the idea were the whole story, the story would already be over.

The delta is where the idea lives.

In every system above, content addressing is a tool layered on top of an operating system that does not share the assumption. Git is content-addressed; the filesystem underneath it is not, so Git maintains its own object store as an application. Nix builds a content-addressed store inside a normal POSIX tree of files-at-locations. OSTree content-addresses the system files, but the running kernel, the package tooling, and the user’s documents are governed by the old byte-level, location-named model. Each is an island of good ideas floating on a substrate that disagrees with it, so each has to carry its own machinery: its own object store, its own integrity checks, its own notion of identity, re-implemented because the OS would not provide it.

Cathedral’s bet is to make content addressing the OS’s native substrate rather than a tool on top of one. There is one content-addressed object graph underneath everything. The filesystem is it. The package format is a tree in it. An update is a repointed root in it. The boot chain measures hashes of it. They are not four systems that each independently reinvented content addressing and now have to be kept in sync. They are one substrate, and these are four problems that turned out to be the same shape once you stopped solving them separately.

That collapse is the whole pitch. Today a serious system carries a filesystem, plus a snapshot tool, plus a dedup layer, plus a package manager, plus an update mechanism with its own rollback story, plus a measured-boot stack, each with its own format, its own failure modes, and its own seams where bugs and attacks live. Make content the way you name things, everywhere, and those seams close, because there was only ever one mechanism wearing six costumes.

It is not free. Making identity intrinsic means the system must maintain a map from every hash to where its bytes live, and that index has to be engineered carefully to stay bounded, the same dedup-table cost that has bitten content-addressed storage before. You pay for that index, deliberately, once.

The mental model fits in a sentence: name data by the hash of its contents, do it everywhere instead of in one tool, and the filesystem, the package, the update, the rollback, and the proof of what you are running stop being separate problems.