Proof-oriented Event Sourcing (Experimental)

Lokhesh Ujhoodha avatar
Lokhesh Ujhoodha

Disclaimer: The product is at an experimental stage. If you find it useful, your feedback would be invaluable at: https://github.com/kurrent-io/poes/

Formal Verification Guarantees Made Accessible

Software systems fail. They fail in production, they fail silently, and they fail in ways that unit tests never anticipated. The gap between “tested” and “correct” is where bugs live and in domains like finance, healthcare, and infrastructure, those bugs carry real consequences.

Formal verification promises to bridge this gap: instead of checking a handful of examples, prove that a property holds for all possible inputs. But formal verification has historically been impractical for production software. The tools are specialized, the specifications are separate from the code, and the proof effort scales poorly.

POES (Proof-Oriented Event Sourcing) proposes a different approach. By combining event sourcing’s inherent structure with formal verification techniques, it narrows the proof domain to something tractable and makes verification a natural byproduct of writing domain logic, not a separate activity.

A Coding Agent First Library

POES is designed so that AI coding agents can verify the code they generate. An agent reads a domain specification, produces an event-sourced aggregate, and runs POES verification in the same step. The result is deterministic and machine-readable: either every invariant holds across all reachable states, or the agent gets a minimal counterexample pinpointing exactly what broke. There is no ambiguity to hallucinate around.

We have internally benchmarked different iterations of the product using Liquid Haskell, TLA+ transpilation, FSharp and Lean4. However, we have noticed that LLMs are just better at writing code than writing proof and in Python (a close second was Typescript). This is why we chose to provide a first build in Python.

This verification loop changes how agents interact with existing code. When a future change breaks a previously proven property, the agent knows immediately because the proof that passed yesterday now fails, and the counterexample shows why. Proofs accumulate as the system grows, forming a regression anchor that catches semantic errors, not just syntactic ones. An agent can refactor with more confidence because POES will flag any behavioral change that violates a specified property.

The developer’s role shifts from writing event sourcing machinery to describing business rules and reviewing the proof of work document that POES generates. Guards express when transitions are allowed. Invariants express what must always hold. Post-conditions express what each transition must accomplish. The agent handles the rest: wiring up state, generating transitions, running proofs. The final step in the verification chain is always a human reading the proof of work. The machine proves properties hold; the human confirms the right properties were specified. This is the irreducible responsibility that no tool can automate away, and POES makes it legible by producing a structured document that maps every invariant, guard, and temporal property to its verification result.

This expands the market for formally verified event sourcing from “developers who know ES and formal methods” to “developers who can describe their domain.”

The Shared Challenge: Specification Completeness

Formal verification provides mathematical certainty that a system satisfies its specification. Tools like TLA+, Coq, Z3, and Liquid Haskell can establish guarantees no amount of testing can match: safety properties (“something bad never happens”), type-level invariants (“invalid states are unrepresentable”), and transition correctness (“every state change preserves all invariants”). POES builds on this tradition, applying the same principles to event-sourced aggregates.

However, every verification approach (TLA+, Coq, POES included) shares the same foundational challenge: you can only prove what you specify. An invariant you didn’t write won’t be caught by any tool. A transition you didn’t model won’t be checked. This is not a weakness of formal methods; it is the nature of verification itself.

The specification-implementation gap. In traditional formal methods, the specification is a separate artifact: a TLA+ model, a Coq proof, a Z notation document. Keeping this in sync with the production code is a persistent source of exactly the bugs verification was supposed to prevent. POES addresses this by embedding the specification in the production code: guards, invariants, and post-conditions are Python lambdas operating on the same data structures used at runtime.

Environmental assumptions. Every formal model makes assumptions about its environment: message delivery, clock synchronization, hardware reliability. These assumptions hold across all verification approaches. POES delegates environmental concerns (ordering, durability, concurrency) to the event store (KurrentDB), keeping the proof domain focused on what we can verify: the pure logic of state transitions.

Why Event Sourcing Narrows the Proof Domain

Most software is hard to verify because state is everywhere, unbounded and most importantly mutable. Immutability is important because we need pure functions to form the basis of verification i.e. given a particular state, every input to a function should have the same output and produce the same resulting state.

An event-sourced system has a single, precise definition of state:

State₀ = initial
Stateₙ = fold(apply, State₀, [e₁, e₂, ..., eₙ])

Every state is the result of folding a pure function over an ordered sequence of immutable events. This is not an approximation. It is the actual runtime behavior. The apply function is deterministic, side-effect-free, and referentially transparent. The same events always produce the same state.

No side effects means the function’s behavior is fully determined by its inputs. No mocking, no environment simulation. The apply function is the production code, eliminating the specification-implementation gap.

Algebraic data types for events mean the domain of inputs is finite and enumerable. State is a record of values that can be represented symbolically. This keeps the state space tractable for exploration.

Sequential event streams per aggregate eliminate concurrency within the proof domain. We delegate the responsibility for concurrent access, ordering guarantees, and durability to the event store (KurrentDB), removing environmental concerns from the verification scope entirely.

The narrowing effect is important:

The proof domain shrinks from “arbitrary programs” to “pure functions over algebraic data types,” precisely where automated theorem provers excel. Guards constrain which transitions can fire. Invariants constrain which states are valid. Post-conditions constrain what each transition must accomplish. Together, these form a complete specification that can be checked mechanically.

The insight behind POES is that event sourcing doesn’t just provide an architectural pattern. It provides a proof domain. And by keeping the specification inside the production code (as Python lambdas operating on the same data structures used at runtime), POES avoids the specification-implementation gap that undermines traditional formal methods.

Four Layers of Verification

POES verifies aggregates through four complementary techniques. The first three verify the model; the fourth verifies production data against the model. Together they provide guarantees that approach what formal proof delivers, without requiring a theorem prover.

Layer 1: Property-Based Testing

Traditional unit tests check specific examples: “deposit 100 into an account with balance 200, expect balance 300.” This verifies one point in the input space. Property-based testing, powered by Hypothesis, verifies classes of behavior.

For each (transition, invariant) pair, POES generates hundreds of random states from declared strategies, filters to those where the invariant holds and the guard passes, applies the transition, and checks that the invariant is preserved:

@dataclass(frozen=True)
class BankAccount:
    balance: int
    is_open: bool

Check.define("BankAccount", BankAccount)
    .with_field("balance", st.integers(0, 1000))
    .with_field("is_open", st.booleans())
    .with_invariant("BalanceNonNegative", lambda s: s.balance >= 0)
    .with_parametric_transition("Deposit",
        params={"amount": st.integers(1, 500)},
        guard=lambda s, amount: s.is_open and s.balance + amount <= 1000,
        apply=lambda s, amount: replace(s, balance=s.balance + amount),
        ensures=lambda before, after, amount: after.balance == before.balance + amount)

Hypothesis isn’t just generating random numbers. It’s systematically exploring edge cases (zero, one, maximum values, boundary conditions), and when it finds a failure, it shrinks the counterexample to the minimal reproducing case. A failing test doesn’t say “it broke somewhere”; it says “deposit of 501 into an account with balance 500 violates BalanceBounded.”

For parametric transitions, Hypothesis generates both random states and random parameter values. This proves that the property holds not just for one fixed deposit amount, but for arbitrary amounts within the declared range. The difference between testing deposit(100) and testing deposit(amount) for all valid amounts is the difference between checking one point and checking a region.

This layer is probabilistic. 500 examples don’t constitute a mathematical proof. But Hypothesis’s coverage-guided generation makes it extraordinarily effective at finding bugs. In practice, if a property holds for 500 Hypothesis-generated examples with shrinking, the remaining failure cases (if any) are exotic enough that they rarely matter in production.

Layer 2: Exhaustive State Exploration

Property testing verifies that transitions preserve invariants from arbitrary starting states. State exploration answers a different question: starting from the actual initial state, can any sequence of valid transitions reach a state that violates an invariant?

POES performs breadth-first search from the initial state, applying every enabled transition at every reachable state:

Frontier: [BankAccount(balance=0, is_open=True)]

Step 1: BankAccount(balance=0, is_open=True)
  → Deposit(42)  → BankAccount(balance=42, is_open=True)  ✓
  → Deposit(250) → BankAccount(balance=250, is_open=True) ✓
  → Close        → BankAccount(balance=0, is_open=False)  ✓
  → Withdraw: guard fails (0 < 50), skip

Step 2: BankAccount(balance=42, is_open=True)
  → Deposit(...)  → various new balances ✓
  → Withdraw: guard fails (42 < 50), skip
  ...

Total: 22 reachable states, all satisfy all invariants ✓

This is exhaustive within the reachable state space. Every state that can actually occur in production (given the modeled transitions) is checked. If the exploration completes without hitting limits, the result is not probabilistic. It is a proof that no reachable state violates any invariant and no reachable transition violates any post-condition.

The key insight is that event sourcing (along with DDD actually) makes the reachable state space small enough to explore. A bank account with balance ∈ [0, 1000] and is_open ∈ has 2,002 possible states, but only 22 are reachable from balance=0, is_open=true through the defined transitions. Guards act as natural pruning, constraining the state graph to the states that matter.

For aggregates with larger state spaces, POES applies configurable limits (default: 1,000,000 states for BFS, 10,000 for temporal analysis). When limits are reached, the result is “safe within explored region” rather than “proven safe everywhere,” a graceful degradation that still provides stronger guarantees than testing alone.

Layer 3: Temporal Property Checking

Safety properties say “nothing bad happens.” But many real requirements are about liveness: “something good eventually happens.” A payment system must eventually settle. A reservation must eventually be confirmed or cancelled. A workflow must not get stuck.

POES builds the full state graph and uses Tarjan’s algorithm to find strongly connected components (cycles). It then checks three kinds of temporal properties:

Eventually(P): From any infinite execution, a state satisfying P is eventually reached. POES checks this by verifying that no cycle in the state graph avoids P entirely.

.with_eventually("CanClose", lambda s: not s.is_open)

If every cycle in the state graph contains at least one state where is_open is false, the property holds. If there’s a cycle where the account stays open forever with no path to closing, verification fails.

LeadsTo(P, Q): Whenever P becomes true, Q eventually becomes true. POES checks that from any state satisfying P, it’s impossible to reach a cycle that avoids Q.

.with_leads_to("BalanceCanDrain",
    trigger=lambda s: s.is_open and s.balance > 0,
    response=lambda s: s.balance == 0)

AlwaysEventually(P): P holds infinitely often on every infinite execution. Every cycle must pass through a state satisfying P.

.with_always_eventually("CanDeposit", lambda s: s.is_open)

These temporal checks are not approximations. On the explored state graph, they are exact: Tarjan’s SCC decomposition identifies every cycle, and every cycle is checked. If the state graph is fully explored (no limit reached), temporal verification is a proof.

Layer 4: Persistence Verification

Layers 1–3 prove that the apply function is correct in theory. But production data might violate invariants due to bugs in command handling, serialization issues, manual database edits, or schema migrations. The fourth layer closes this gap by replaying actual events from the database and verifying that every intermediate state satisfies every invariant.

result = builder.verify_persistence(
    client=client,
    stream_prefix="BankAccount",
    initial=BankAccount(balance=0, is_open=True),
    apply=apply,
    event_from_json=event_from_json,
    stream_ids=["acc-1", "acc-2"],  # or None for all streams
)

For each stream, POES replays the event history through apply, checking all invariants after every event. It fails fast on the first violation, reporting the exact stream, event position, event type, and which invariant was violated:

  ┌─ Persistence Verification ────────────────────────────────────────────────┐
  │  ✓ BankAccount-acc-1  (3 events)
  │  ✓ BankAccount-acc-2  (7 events)
  │  ✗ BankAccount-acc-3  event #4 (Withdrawn) violates BalanceNonNegative
  └─────────────────────────────────────────────────────────────────────────────┘

When stream_ids is omitted, POES reads all streams matching the prefix in a single call, groups them by stream name, and verifies each independently. This makes it practical to verify an entire aggregate type’s production data, not just sampled streams.

This layer transforms POES from a design-time tool into a runtime verification system. The model proves that apply should preserve invariants; persistence verification proves that your production data actually does.

What This Proves (and What It Doesn’t)

Let’s be precise about the guarantees.

What POES proves:

  1. Invariant preservation under property testing. For hundreds of randomly generated (state, parameter) combinations, every transition preserves every invariant. Hypothesis’s shrinking and edge-case generation make this a strong probabilistic guarantee, not a proof, but far beyond what hand-written tests achieve.

  2. Invariant preservation over all reachable states. If BFS exploration completes within limits, every state reachable from the initial state through any sequence of transitions satisfies all invariants. This is exhaustive within the model.

  3. Post-condition correctness. Every transition’s ensures clause is checked both probabilistically (Hypothesis) and exhaustively (BFS). If ensures says “the balance increases by exactly the deposit amount,” that’s verified everywhere.

  4. Liveness properties. Temporal properties are checked over the complete reachable state graph. If exploration completes, these are proofs: no cycle violates Eventually, LeadsTo, or AlwaysEventually.

  5. Production data consistency. Persistence verification replays actual events from the database and checks that every intermediate state satisfies all invariants. This catches bugs that the model can’t: serialization errors, manual edits, migration issues, or command-handler bugs that produce events the model wouldn’t allow.

What POES does not prove:

  1. The model matches the real system. POES verifies the apply function and its guards. If the actual system has additional state, side effects, or transitions not modeled, those aren’t covered. This is the classic “specification gap,” but event sourcing minimizes it because the apply function is the production code. Persistence verification further narrows this gap by checking real data against the model’s invariants.

  2. Unbounded state spaces. If the state space exceeds exploration limits, BFS reports partial results. Property testing still provides probabilistic coverage, but the exhaustive guarantee is weakened.

POES provides verification guarantees that approach formal proof within a well-defined scope. For the apply function, the pure, deterministic heart of an event-sourced aggregate, the guarantees are strong. Property testing provides high-confidence probabilistic coverage. State exploration provides exhaustive coverage within reachable bounds. We recommend a DDD approach to divide your state space into smaller parts if you regularly encounter large state space exploration delays during development. Temporal checking provides exact liveness proofs on the explored graph. Persistence verification extends these guarantees to production data.

The Specification Is the Code

With POES there isn’t any single verification technique, i.e., the specification lives inside the code. There’s no separate TLA+ model to keep in sync. No Z3 notation that drifts from the implementation. The guards, invariants, and post-conditions are Python lambdas that operate on the same data structures used in production.

result = (
    Check.define("BankAccount", BankAccount)
    .with_initial(BankAccount(balance=0, is_open=True))
    .with_field("balance", st.integers(0, 1000))
    .with_field("is_open", st.booleans())
    .with_invariant("BalanceNonNegative", lambda s: s.balance >= 0)
    .with_invariant("BalanceBounded", lambda s: s.balance <= 1000)
    .with_parametric_transition("Deposit",
        params={"amount": st.integers(1, 500)},
        guard=lambda s, amount: s.is_open and s.balance + amount <= 1000,
        apply=lambda s, amount: replace(s, balance=s.balance + amount),
        ensures=lambda before, after, amount: after.balance == before.balance + amount)
    .with_transition("Withdraw",
        guard=lambda s: s.is_open and s.balance >= 50,
        apply=lambda s: replace(s, balance=s.balance - 50),
        ensures=lambda before, after: after.balance == before.balance - 50)
    .with_eventually("CanClose", lambda s: not s.is_open)
    .with_leads_to("BalanceCanDrain",
        trigger=lambda s: s.is_open and s.balance > 0,
        response=lambda s: s.balance == 0)
    .run()
)

This is simultaneously:

  • A domain model (the BankAccount dataclass and its transitions)
  • A specification (invariants, guards, post-conditions, temporal properties)
  • A test suite (Hypothesis property tests, BFS exploration, SCC analysis)

When you change a guard, verification reruns against the new guard. When you add a transition, it’s automatically tested against all existing invariants. When you add an invariant, it’s checked against all existing transitions and all reachable states. The specification and the implementation evolve together because they are the same artifact.

Verification Output

Running POES verification produces a structured report:

╔══════════════════════════════════════════════════════════════════════════════╗
║  VERIFYING: BankAccount
╚══════════════════════════════════════════════════════════════════════════════╝

  Fields: 2 | Max examples: 500 | Invariants: 2 | Transitions: 2

  ┌─ Property Testing ──────────────────────────────────────────────────────────┐
  │  ✓ Deposit/BalanceNonNegative  (500 examples)
  │  ✓ Deposit/BalanceBounded  (500 examples)
  │  ✓ Withdraw/BalanceNonNegative  (500 examples)
  │  ✓ Withdraw/BalanceBounded  (500 examples)
  └─────────────────────────────────────────────────────────────────────────────┘

  ┌─ State Exploration ─────────────────────────────────────────────────────────┐
  │  ✓ All 22 reachable states safe
  └─────────────────────────────────────────────────────────────────────────────┘

  ┌─ Temporal Properties ──────────────────────────────────────────────────────┐
  │  ✓ Eventually(CanClose)
  │  ✓ LeadsTo(BalanceCanDrain)
  └─────────────────────────────────────────────────────────────────────────────┘

  ✓ VERIFIED: All 6 proofs passed, 22 states explored (150ms)

Each line is a verifiable claim. “Deposit/BalanceNonNegative (500 examples)” means Hypothesis generated 500 valid (state, amount) pairs, applied the deposit transition, and confirmed the balance remained non-negative in every case. “All 22 reachable states safe” means BFS exhaustively visited every state reachable from the initial state and confirmed all invariants hold. “Eventually(CanClose)” means Tarjan’s algorithm found no cycle in the state graph where the account can never close.

When connected to a database, persistence verification adds a fourth section:

  ┌─ Persistence Verification ────────────────────────────────────────────────┐
  │  ✓ BankAccount-acc-1  (3 events)
  │  ✓ BankAccount-acc-2  (7 events)
  └─────────────────────────────────────────────────────────────────────────────┘
  ✓ Persistence verified: 2 streams, 10 events (42ms)

When verification fails, POES provides counterexamples: the specific state and transition that violated a property, shrunk to the minimal reproducing case.

Where This Sits in the Verification Landscape

It’s worth positioning POES relative to existing approaches:

Unit tests check individual examples. They’re fast and familiar, but they only verify the cases you thought to write. A passing test suite says “these specific scenarios work,” not “the system is correct.”

Property-based testing (Hypothesis, QuickCheck) generates random inputs and checks properties. This is a significant step up, generating hundreds of examples including edge cases. But it’s still sampling from the input space, not exhausting it.

Model checking (TLA+, Alloy, SPIN) exhaustively explores state spaces from a formal specification. This provides strong guarantees, but the specification is separate from the code. Keeping the model in sync with the implementation is a persistent source of bugs.

Theorem proving (Coq, Isabelle, Lean, F*) provides mathematical proof. This is the gold standard for correctness, but the effort required is measured in person-months, and the proofs are in a separate language that most developers can’t read.

POES occupies a pragmatic middle ground:

ApproachCoverageSpec LocationEffort
Unit testsExamplesIn test filesLow
Property testingProbabilisticIn test filesLow-Medium
POESProbabilistic + Exhaustive + TemporalIn domain codeMedium
Model checkingExhaustive (separate model)Separate specHigh
Theorem provingMathematical proofSeparate proofVery High

POES achieves near-model-checking guarantees within the event sourcing domain by exploiting the structure that event sourcing already provides. The apply function is the model. The guards and invariants are the specification. No translation step required.

Practical Implications

For safety-critical domains: POES won’t replace DO-178C certification or medical device verification. But for financial systems, booking engines, inventory management, and similar domains where correctness matters but full formal verification is impractical, POES provides evidence of correctness that’s far stronger than test suites.

For development workflow: Verification is fast enough to run in CI, fast enough to run during development. The feedback loop is immediate: change a guard, see whether the new guard preserves your invariants.

For specification: The hardest part of formal methods has always been writing the specification. POES makes specification natural: guards are business rules (“you can only withdraw if the balance is sufficient”), invariants are domain constraints (“balance must never go negative”), post-conditions are expectations (“deposit increases balance by exactly the deposited amount”). These are things domain experts already know. POES just gives them a place to live where they’ll be checked automatically.

For AI-assisted development: As AI agents write more production code, the question shifts from “can the developer verify this?” to “can the system verify this?” POES provides a machine-checkable correctness contract. An AI can generate an event-sourced aggregate and POES can verify it, with no human review required for the properties that are specified. The specification becomes the interface between human intent and machine implementation.

Conclusion

The gap between “tested” and “correct” will never fully close for real-world software. But it can be narrowed dramatically. POES demonstrates that for event-sourced systems, a growing architectural pattern in domains where correctness matters, formal verification techniques can be applied with production-practical effort.

The key insight is structural: event sourcing provides pure functions, bounded state spaces, and explicit transitions. These are exactly the properties that make verification tractable. POES doesn’t bolt verification onto arbitrary code. It recognizes that event sourcing already encodes the structure that verification needs and provides the tools to exploit it.

The result is not a formal proof in the mathematical sense. It’s something more practical: a layered verification system where property testing catches edge-case bugs, state exploration provides exhaustive coverage of reachable states, temporal checking proves liveness, and persistence verification confirms that production data actually satisfies the model’s invariants. All of this is specified in the same code that runs in production, executable in seconds, and yields machine-verifiable results.

For systems where bugs carry real consequences, that’s a meaningful guarantee.