Composing Deterministic-Simulation Testing Across Heterogeneous Substrates
Abstract
Deterministic-simulation testing, the discipline of running a system under fault injection in a reproducible simulator (exemplified by FoundationDB and TigerBeetle’s VOPR), is increasingly recognized as table stakes for substrates that ship compliance, audit, or replay guarantees. The standard treatment assumes a single substrate: one language, one runtime, one allocator, one scheduler. Real systems frequently span multiple substrates with incompatible determinism profiles: a native data plane where determinism is achievable through static allocation and controlled IO, and a managed-runtime control plane where the runtime’s own scheduling and garbage collection preclude byte-exact reproducibility. This note describes a compositional technique for extending deterministic-simulation guarantees across such a boundary by reducing the determinism guarantee at each layer to a cryptographic fingerprint computed at the layer interface. The technique was developed for WunderOS, an agentic operating system whose substrate spans Zig (deterministic-simulation-tested under a VOPR-style harness) and BEAM (covered by language-native property and chaos testing). The composition produces a stronger guarantee than either substrate alone: same seed, same execution, same trace events, same Merkle root, same chain head, across the heterogeneous boundary, without requiring uniform substrate discipline.
1. Motivation
Deterministic-simulation testing pairs a normal test harness with a runtime that controls every source of non-determinism (clock, scheduler, IO, fault injection) so that test runs are reproducible from a seed. FoundationDB pioneered the technique at scale; TigerBeetle’s VOPR continues the lineage with deterministic execution as a first-class architectural commitment. The discipline catches concurrency and storage-fault bugs that are essentially unfindable by other means, and it converts production incidents into reproducible test cases via seed replay. For systems that ship audit or compliance guarantees, deterministic-simulation testing is approaching table-stakes status; not having it is strictly more expensive than having it.
The standard treatment assumes the system under test is single-substrate. Every byte in the test execution is produced by code under the simulator’s control. This assumption holds cleanly for systems written end-to-end in a language with manual allocation, predictable scheduling, and explicit IO (Zig, C, certain Rust subsets) but fails for systems that include a managed-runtime layer with garbage collection and runtime-controlled scheduling. BEAM, the JVM, and CLR all preclude byte-exact reproducibility in their default execution modes; the runtime’s own decisions about allocation, scheduling, and GC are not part of the test harness’s control surface.
Real systems frequently span the boundary. Native code is the right substrate for hot computational paths where determinism is achievable through architectural commitment. Managed runtimes are the right substrate for orchestration, supervision, and concurrent IO where the runtime’s facilities (BEAM’s OTP, the JVM’s concurrency primitives) are exactly what the workload wants. A heterogeneous-substrate system gets the right tool for each layer at the cost of an impedance mismatch in test discipline: the native layer admits VOPR-style determinism, the managed layer does not.
This note describes how to compose deterministic-simulation guarantees across that mismatch. The technique does not require deterministic execution on the managed-runtime side, since that is structurally infeasible without rebuilding the runtime. Instead, the technique reduces the determinism guarantee to a cryptographic fingerprint computed at the layer interface, and validates the fingerprint under VOPR control on the native side. The managed-runtime side remains under language-native test discipline (property tests, chaos tests, generative testing). The composition produces a guarantee that is stronger than either substrate alone provides, and matches the audit-grade expectation of regulated-customer deployments.
2. Background
2.1 Deterministic-simulation testing
A deterministic-simulation test harness controls all sources of non-determinism in a system under test such that execution is a pure function of (code, seed, input). Time is virtual, advanced by the simulator; IO is intercepted and simulated; concurrency is scheduled deterministically; faults are injected at controlled positions. Two runs against the same seed produce byte-identical execution traces. A bug discovered in production is reproduced by recording the seed and replaying. The technique is associated most prominently with FoundationDB; TigerBeetle’s VOPR (the “Viewstamped Operations Property-based Replay” harness, embedded in the TigerBeetle codebase) is the canonical contemporary reference and the proximate inspiration for the work described here.
Determinism in this sense requires careful substrate discipline: static allocation, single-threaded execution per simulator instance, controlled access to system services. Most languages can be written this way with effort; some languages and runtimes structurally preclude it. The cost of the discipline is real but bounded; the benefit is a class of bug-finding capability that is otherwise inaccessible.
2.2 Sound trace recording
A separate but related discipline is sound trace recording in distributed and concurrent systems. The RIARC algorithm (Aceto, Attard, Francalanza, Ingólfsdóttir, 2024) gives a decentralized instrumentation method that guarantees trace soundness, that is, that the reported event sequence reflects an actual execution despite asynchronous loss and reordering, through next-hop routing and rearrangement at monitors. Soundness in this sense is a formal property: the trace stream is not a sampled approximation of execution but a verifiable witness to it. For audit and compliance applications, soundness is the relevant correctness property; deterministic replay of the system’s own execution is a stronger property and not always available.
2.3 Cryptographic anchoring
Trace records can be anchored cryptographically by chaining each record’s hash into its successor (a Merkle DAG over the trace WAL) and signing chain heads with a long-term key. Tamper-evidence at the storage layer is a standard application; the technique generalizes. A chain head functions as a fingerprint of the entire history that produced it: any change to any prior record changes the chain head. For a deterministic execution, the chain head is also a fingerprint of the execution itself.
3. The composition
The technique combines deterministic-simulation testing on the native side, sound trace recording with cryptographic anchoring across the whole system, and reduction of the cross-layer determinism guarantee to fingerprint comparison at the layer boundary.
3.1 Architecture
The system under test has two substrates. Substrate A is native, statically allocated, deterministic under simulator control; this is the substrate VOPR exercises. Substrate B is a managed runtime; this substrate runs under its own language-native test discipline (property tests, chaos tests, generative testing) and is not VOPR-controllable. The substrates communicate through a defined interface — in WunderOS’s case, NIF calls between BEAM (Substrate B) and Zig (Substrate A).
A trace recording layer spans both substrates. Trace events are produced at instrumented points throughout the system, routed through a sound-ordering algorithm (RIARC), and emitted as a single ordered stream into a WAL. The WAL is structured as a Merkle DAG: each batch of events is hashed, the batch hash is chained into the previous batch’s hash, and the resulting chain head is signed with a long-term key.
3.2 Determinism preconditions
For the chain head to function as a deterministic fingerprint of execution, three sources of non-determinism must be gated through abstractions that admit substitution.
Wall-clock timestamps. Every trace event records the time of its emission. Production code reads the system clock; under VOPR, the clock must be the simulator’s virtual clock. This is achieved through a Clock interface with a default production implementation reading the system clock and an alternative implementation backed by the simulator’s tick counter.
Wall-clock timers. Watchdogs, batch-close deadlines, and any other timer-driven behavior must read the same Clock interface. No direct calls to system timing services from instrumented code.
Key material. Signing keys are non-deterministic per machine in production (HSM, TPM, OS keystore). Under VOPR, keys must be deterministically derivable from the test seed. This is achieved through a KeyProvider interface with a production implementation backed by the OS keystore and an alternative implementation that materializes RFC 8032 deterministic keypairs from a seed.
These two seams are the only points at which non-determinism enters the trace recording machinery. With them controlled, the chain head becomes a pure function of (code, seed, input).
3.3 The fingerprint reduction
Under VOPR control on the native substrate, the system runs against a seed s. Instrumented events flow through the trace recording layer. The chain head of the resulting Merkle DAG is recorded at the end of the run. The claim is that two runs against the same seed produce byte-identical chain heads.
Let exec(s) denote the execution under VOPR control given seed s, events(e) the trace events produced by execution e in sound order, merkle(E) the Merkle root over event sequence E, sign(k, h) the Ed25519 signature of h under key k, and kp(s) the deterministic keypair derived from s via RFC 8032. The chain head after a run from seed s is:
head(s) = sign(kp(s), merkle(events(exec(s))))
Each function in the composition is a deterministic mapping. exec is deterministic by VOPR construction (clock seam, IO interception, controlled scheduling). events is deterministic given exec because RIARC’s sound-ordering is a deterministic function of the event stream. merkle is deterministic given event order. kp is deterministic per RFC 8032. sign is deterministic per RFC 8032. The composition of deterministic functions is deterministic, so head(s) is a pure function of s.
For two runs r₁ and r₂ against the same seed s, this gives:
head(s)|r₁ = head(s)|r₂
The fingerprint reduction transforms a system-level determinism question (“does the whole system replay byte-exactly?”) into a single-value comparison (“does head(s) match across runs?”). Mismatch indicates divergence somewhere in the substrate. The chain structure localizes the divergence: walking the chain from genesis identifies the first batch whose seal differs across runs, which in turn identifies the first event sequence in which the substrate produced different output for the same input.
3.4 What the managed-runtime side gets
The managed-runtime side does not run under VOPR control. Its trace events enter the same recording layer and contribute to the same chain. From the chain’s perspective, events from the managed-runtime side are simply additional events; the chain does not care which substrate produced them.
This means the chain head is a fingerprint of the entire system’s execution, not just the native substrate’s. If the managed-runtime side is non-deterministic, the chain head will differ across runs regardless of seed. This is acceptable and indeed informative: it demarcates exactly which parts of the system contribute to the cross-run-stable fingerprint and which do not.
In WunderOS’s case, the managed-runtime side is intentionally allowed to be non-deterministic (BEAM scheduling, message ordering, OTP supervision behavior under fault) because language-native test discipline covers it. The chain head’s stability under VOPR replay is therefore restricted to the events produced by the native substrate. Cross-substrate events (NIF call boundaries, message receipts) appear in the chain with timestamps from the Clock seam; their content is deterministic if the substrate that produced them is. The chain remains a useful determinism oracle for the native substrate even when the managed-runtime side contributes non-deterministic events to the same stream.
3.5 Substrate-shape mimicry
End-to-end determinism for the algorithm proper, the high-level coordination logic that runs on the managed-runtime side, is not directly testable under VOPR. The available technique is shape-mimicry: write Zig-side test scenarios that model the shape of the managed-runtime algorithm (its state machine transitions, its routing rules, its mode behavior) and run those scenarios under VOPR. The mimicry validates substrate primitives the algorithm relies on. Algorithm-level bugs are then provably algorithm bugs rather than substrate bugs, because the substrate primitives are guaranteed under fault by VOPR coverage of the mimic.
This is a weaker guarantee than VOPR over the actual algorithm code, and the document is honest about that. What it provides is a precise characterization of what each layer of testing covers: substrate primitives by VOPR, algorithm-level invariants by managed-runtime property tests, cross-substrate composition by chain-head fingerprint. Each guarantee is independent; together they cover the system.
4. Worked construction
The construction below describes the actual deployment in WunderOS.
4.1 Layer interfaces
Two Gleam interfaces gate the controlled non-determinism sources. The Clock interface exposes now_micros: fn() -> Int and is constructed either from the system clock (production) or from a caller-supplied function (test, VOPR). The KeyProvider interface exposes sign: fn(BitArray) -> BitArray and pubkey: fn() -> BitArray and is constructed either from a 32-byte seed (deterministic, RFC 8032) or from arbitrary signing functions (HSM, mock).
These interfaces are wired into the trace batcher (timestamps, signatures), the watchdog (drift detection), and the verifier (chain audit). Both interfaces have minimal call-site overhead. The Clock seam is sub-100ns absolute; the KeyProvider seam is dominated by Ed25519 signing cost (~30µs) and contributes negligible relative overhead.
Concretely, the seams are small Gleam interfaces with substitutable constructors:
pub type Clock {
Clock(now_micros: fn() -> Int)
}
pub fn wall_clock() -> Clock // production: erlang:system_time(microsecond)
pub fn from_fn(f: fn() -> Int) -> Clock // VOPR: caller supplies time source
pub type KeyProvider {
KeyProvider(sign: fn(BitArray) -> BitArray, pubkey: fn() -> BitArray)
}
pub fn from_seed(seed: BitArray) -> Result(KeyProvider, String) // RFC 8032 deterministic
pub fn from_fn(sign, pubkey) -> KeyProvider // HSM / mock
Under VOPR, callers wire Clock from the simulator’s tick counter and KeyProvider from a seed derived from the simulator’s RNG. Same simulator seed gives same clock readings, same keypair, same signatures, byte-identical chain head.
4.2 The trace WAL and Merkle chain
Trace events are batched into seal records. Each seal record contains the batch’s events, a BLAKE3 Merkle root over those events, the previous seal’s hash, and an Ed25519 signature over the seal’s contents. The chain head is the hash of the most recent seal. Inclusion proofs allow any event to be verified against its seal’s Merkle root; chain walking allows any seal to be verified against the chain head.
The chain advances by one seal per batch close. Batch close is timer-driven (read through the Clock seam) and size-driven; under VOPR the timer fires at simulator-virtual times, deterministic from the seed.
4.3 The invariant oracle
A test harness on the native side runs the system under VOPR control against a seed, captures the chain head at end of run, and compares against a recorded fingerprint. Five tests establish the property surface, stated as predicates on head(s) for runs of length N:
∀ s. head(s)|r₁ = head(s)|r₂ (determinism)
∀ s₁, s₂. s₁ ≠ s₂ ⇒ head(s₁) ≠ head(s₂) (sensitivity)
∀ s, i. perturb(events(exec(s)), i) ⇒ (localization)
seal(j) differs ∀ j ≥ i, seal(j) matches ∀ j < i
∀ s, e ∈ events(exec(s)). (inclusion)
verify(inclusion_proof(e), seal_root(batch_of(e))) = ✓
∀ s. walk_prev(head(s)) reaches genesis after exactly N steps (chain-walk)
∧ ∀ k. seal(k).merkle_root = recorded_root(k)
A sixth scenario confirms that simulator clock drift modes (linear, periodic) do not break the determinism guarantee; the Clock seam contract holds under non-trivial virtual-time perturbation.
The first two predicates establish the fingerprint as a useful identity (deterministic and discriminating). The third localizes failures to specific batches under perturbation, which is the property an investigator actually needs when a regression appears. The fourth and fifth establish that the chain itself is a verifiable witness to its content, so the fingerprint is not just an opaque hash but a navigable structure.
4.4 Algorithm-shape mimicry tests
Five additional VOPR scenarios exercise the substrate-shape of the trace algorithm proper, mapping to the algorithm’s identified crash modes. Each scenario asserts a specific invariant: pending operations unwind correctly under participant death, drain handoff preserves exactly-once delivery semantics, watchdog restart degrades gracefully without phantom completions, concurrent operations on shared state do not cross-talk, producer-side store ordering preserves visibility discipline. A composition-determinism check confirms that the entire mimic state machine is deterministic under the simulator’s clock; drift in scenario tapes implies real divergence rather than test-harness flakiness.
4.5 Seam overhead validation
A separate benchmark validates that the seam interfaces do not impose meaningful runtime cost. The Clock seam is gated absolutely (≤100ns/call) rather than relatively, because relative overhead measures of sub-microsecond operations are misleading. The KeyProvider seam is gated relatively (≤5%) because signing cost dominates and seam dispatch is sub-microsecond. Both gates are continuously enforced by a scoreboard system.
5. Discussion
5.1 What this gives that single-substrate VOPR does not
A heterogeneous-substrate system without composition is testable only at the substrate level: VOPR over the native substrate, language-native discipline over the managed substrate, no cross-layer guarantee. The fingerprint reduction adds a cross-layer guarantee at exactly the place that matters for audit applications: the trace layer, which is what auditors actually inspect. Determinism of the substrate primitives plus soundness of the trace plus cryptographic anchoring of the chain composes into a property the auditor can verify directly: the chain head is reproducible from the seed under VOPR conditions, and the chain itself is verifiable from its content under any conditions.
5.2 What this does not give
The technique does not make the managed-runtime side deterministic. A bug that manifests through managed-runtime scheduling, GC pause behavior, or concurrent message ordering is not reproducible from a VOPR seed. It is reproducible only from a recorded trace, and only to the extent that the trace captured enough information to drive replay. Deterministic replay of managed-runtime code is structurally a different problem (durable execution frameworks like Temporal solve it for restricted programming models, by recording the complete history of activity invocations and replaying against that history). The technique here does not attempt that and does not need to: language-native discipline plus sound trace plus chain anchoring is sufficient for the audit-grade property the system promises.
5.3 What this does not require
The technique does not require uniform discipline across substrates. The native side is deterministic by construction; the managed side is not; the composition still works because the determinism guarantee is reduced to a fingerprint computed at the trace boundary. This is the load-bearing observation: deterministic-simulation testing across heterogeneous substrates is tractable when the cross-substrate guarantee is reduced to comparison of cryptographic anchors rather than byte-exact replay of all substrate behavior.
5.4 Relationship to other deterministic-systems work
TigerBeetle’s VOPR, FoundationDB’s simulation testing, and Antithesis’s commercial offering all assume substrate uniformity at the level the technique operates on. TigerBeetle is single-language by design (Zig top to bottom); FoundationDB’s simulation testing operates within a controlled C++ runtime; Antithesis simulates an entire virtualized environment, which works but at significant operational and licensing cost. The composition technique described here is a cheaper path for systems whose substrate boundary is not virtualizable for cost reasons, or whose managed-runtime side has correctness disciplines that are appropriate to its substrate but not VOPR-shaped. Aceto et al.’s RIARC contributes the trace-soundness layer that makes the cross-substrate event stream coherent in the first place; without sound ordering across the boundary, the chain head’s determinism would be polluted by ordering noise rather than reflecting execution.
5.5 Limitations
The technique does not detect bugs that manifest only through managed-runtime scheduling pathologies (receive-queue ordering races, supervisor-restart timing, ETS access patterns under load). Such bugs are real, are categorically not VOPR-coverable, and require language-native chaos and property testing as a separate discipline. The composition does not subsume that discipline; it complements it.
The technique also does not reproduce outputs of non-deterministic managed-runtime executions. Two production runs of the same workload will produce different chain heads if the managed-runtime side contributes non-deterministic events, even though both runs are individually sound. The chain head is a deterministic fingerprint under VOPR control, not under arbitrary production conditions. The audit-grade guarantee customers receive is integrity (the chain itself is verifiable from its content) and replayability under controlled conditions (VOPR runs against a recorded seed), not byte-equality across production runs.
5.6 What audit customers actually receive
Three properties stack at the customer-facing surface. Soundness: the trace stream reflects an actual execution despite asynchronous loss and reordering, by RIARC. Integrity: any tampering with the trace is detectable, by Merkle chaining and Ed25519 signing of chain heads. Reproducibility under controlled conditions: the substrate primitives that underwrite the trace are deterministically replayable from a seed, by VOPR composition with the Clock and KeyProvider seams.
The third property is the contribution of this note. Soundness and integrity are achievable separately; reproducibility under VOPR is achievable for single-substrate systems; getting all three together across a heterogeneous substrate is what the fingerprint reduction enables.
6. Open questions
Several questions are deferred to a fuller treatment.
Quantitative bounds on substrate-coverage of the mimicry tests. Algorithm-shape mimicry on the native side validates substrate primitives the algorithm relies on, but the relationship between mimic coverage and algorithm correctness is informal. A precise characterization, namely what classes of algorithm bug are guaranteed to be substrate bugs given the mimic surface, would sharpen the guarantee considerably.
Selective extension to managed-runtime orchestration. The technique presented here treats the managed-runtime side as opaque to VOPR. A natural extension applies durable-execution-style controlled-scheduler determinism to specific orchestration code paths, not the full managed runtime, but a designated coordination subset. The combination of language-level testing, controlled-scheduler determinism on the coordination subset, and substrate-level VOPR on the native side would compose three independent determinism guarantees at three layers. We have not yet attempted this; it appears tractable.
Fingerprint stability across substrate evolution. The chain head is sensitive to event content and ordering; substrate changes that affect either will change the fingerprint even when the algorithm is unchanged. A versioning discipline that distinguishes algorithm-level fingerprints from substrate-level fingerprints would be useful for long-running deployments where the substrate evolves under a fixed algorithm.
Composition with adversarial models. Cryptographic anchoring provides tamper-evidence under standard cryptographic assumptions; VOPR provides bug-finding under controlled fault injection. The two threat models are disjoint and the technique inherits both. Whether composing them yields any cross-model guarantee, for instance detecting an adversary that subtly perturbs scheduling to influence outputs, is an open question.
7. Reference implementation
The technique is implemented in WunderOS, an agentic operating system from Pentad Labs. The native data plane is Zig with a VOPR-style harness covering substrate primitives, mode transitions, and crash-mode invariants. The managed control plane is Gleam on BEAM, with property tests, chaos tests, and Erlang QuickCheck/PropEr coverage at the language level. The trace recording layer implements RIARC with Merkle-chained seal records and Ed25519 chain-head signing. The Clock and KeyProvider seams gate the controlled non-determinism sources. The chain-head invariant oracle and five algorithm-shape mimicry scenarios run continuously under the VOPR harness; seam-overhead validation runs continuously against scoreboard gates.
8. Related work
Deterministic-simulation testing: TigerBeetle’s VOPR (in-tree, see TIGER_STYLE.md); FoundationDB’s simulation testing as described in Zhou et al.’s talks and the FDB whitepaper; Antithesis’s commercial deterministic-simulation platform. Sound trace recording: Aceto, Attard, Francalanza, Ingólfsdóttir, “RIARC: A Decentralized Instrumentation Algorithm for Synchronous Outline Runtime Verification,” 2024. Cryptographic anchoring: standard Merkle-DAG and signed-chain constructions; Ed25519 per RFC 8032; BLAKE3 per the BLAKE3 specification. Heterogeneous-substrate testing: little prior art at the level the technique operates on; closest analogues are durable-execution frameworks (Temporal, Cadence) which solve a different problem (workflow replay) by recording activity histories rather than substrate-level events.
A note on method
Written in conversation with Claude Opus 4.7 (Anthropic) as structured interlocutor and prose editor. The ideas, claims, framing, and architectural commitments are mine.
Kendall Clark · k@pentad.ai
—Great Falls, Virginia
May 2026