The Compiler Should Catch Your Protocol Bugs

The worst bug I ever chased surfaced exactly once. Production, peak load, two services disagreeing about whose turn it was to write a record. By the time the pager went off the evidence was already gone, overwritten by the next request. It took a week to reconstruct, and the punchline was that a model checker would have found it in seconds, if anyone had ever connected the spec to the code.

That “if” is the whole essay.

The legacy approach actually works

Formal specification has a reputation for being academic vaporware, and that reputation is wrong.

TLA+ catches real bugs. You describe your system as states and actions, write the invariants that must always hold, and a model checker explores the reachable state space looking for a sequence of steps that breaks one. It finds the interleavings no human would think to try: the message that arrives after the timeout, the retry that lands while the original is still in flight, the leader election that elects two leaders. AWS used it on S3 and DynamoDB and found design defects that would have survived any amount of testing, because testing samples the state space and a model checker tries to cover it. Distributed systems that absolutely cannot be wrong lean on it, and it earns its keep.

The technique is sound, the math is sound, the tooling is sound. Here is what fails.

The spec and the code are two different artifacts, and they drift. You write the TLA+, check it, prove your protocol correct, then go implement it in Go or Rust or C++. From that moment the specification and the running system are connected by nothing but human discipline and good intentions. Someone adds a fast path. Someone reorders two writes for performance. Someone handles an error case the spec never modeled. Every one of those edits is a place where the verified model and the shipped code quietly diverge, and nothing checks that they still describe the same machine. You verified a document and shipped a program, and the proof is about the document.

This is the same disease as a stale comment, scaled up to your entire concurrency design. The most rigorous part of your project is the part guaranteed to rot, because it lives outside the thing that actually runs.

State machines deserve to be first-class

I have argued this before in the context of games, and it is the same argument here. We build state machines constantly: menu flows, connection handshakes, retry-with-backoff, turn-based protocols across a client and a server. And we almost always build them as second-class citizens: an enum for the current state, a pile of logic in a tick or event handler that reads the enum and decides what to do next, transitions smeared across the codebase as scattered assignments to a global.

When the state machine is second-class, code drives the state. Set a breakpoint on a transition and the call stack is an incoherent mess leading back into some event dispatcher. The machine exists only in your head, reconstructed from grepping for every place that writes the state variable, and because it only exists in your head, you cannot check it. There is no machine, only the shadow one cast by the code.

The inversion is to make the state drive the code. The machine is a real, declared structure: named states, explicit transitions, the graph itself a thing the language knows about rather than an emergent property of where you happened to assign an enum.

Omega takes that seriously. Machines, states, and transitions are language constructs, not a pattern you reimplement by hand.

machine Connection::run(&mut self) {
    transition {
        _ -> connecting()
    }

    state connecting(&mut self) {
        self.socket.begin_open();

        transition self.socket.poll() {
            Status::Ready    -> handshake()
            Status::Refused  -> failed()
            Status::Pending  -> connecting()
        }
    }

    state handshake(&mut self) {
        transition self.negotiate() {
            Result::Ok   -> live()
            Result::Bad  -> failed()
        }
    }

    state live(&mut self) { }
    state failed(&mut self) { }
}

A transition is a jump to another state in the same machine. It does not push a frame or store a return address, and the arms are the guarded next-states. This is the actual control flow. Because it is declared, the compiler can see the whole graph: every state, every edge, every reachable path. Set a breakpoint on a state and you know exactly what it means, because the state is a real place and not a value some handler happened to write.

The model and the implementation are the same artifact

When the state machine is first-class, the model you would verify and the code you ship are one artifact. The states are the model’s states. The transitions are the model’s guarded actions. There is no separate document to keep in sync, because the thing the checker reads is the thing the CPU runs.

The Omega proof model maps onto TLA+-style action checking almost line for line. Machine fields are the variables. State parameters are the action inputs. Transitions are the guarded next-state relations. Value constraints and contracts are the invariants. This overlap is deliberate: take the thing TLA+ does well, checking a state machine against its invariants, and aim it at the running machine instead of a paper copy of it.

And transitions carry proof obligations like everything else in the language. Every transition into a typed state has to provide arguments compatible with that state. Every dispatch arm establishes the assumptions its target needs. An owned value that dies on a transition edge must be cleaned up before the jump, and that cleanup becomes a fact the target state can rely on. Contracts ride along:

machine Player::enter_combat(&mut self)
requires
    self.player in Player::Alive
ensures
    self.mode in GameMode::Combat
{
}

The caller must satisfy the requirement, the body must establish the guarantee, and the checker carries those facts forward across the edges of the graph. The compiler does the bookkeeping a TLA+ invariant would do, except about the real states, in the real code, every build.

TLA+ already has an excellent model checker, so the win is not another checker. The win is collapsing the gap between the model you verify and the code you ship, so the verification cannot be about a stale document, because there is no document. There is only the machine.

The legacy promise was: write a spec, verify it, then faithfully implement it and stay faithful forever through every refactor under deadline. Omega’s promise is smaller and therefore keepable: there is nothing to stay faithful to, because the spec is the implementation.

If your protocol has a bug, the compiler is looking at the protocol. So let it find it.