The write-path tax¶

What you'll learn: why japes measurably loses the "naked writes" microbenchmarks to Dominion and Artemis, why that is an API trade- off rather than an engineering regression, what you pay per write and what you get in return, and what Valhalla could fix about it once the EA JIT stops boxing value records across the erased Record parameter boundary.

The benchmarks that show it¶

Three rows in the benchmark suite are dominated by the write-path tax (see iteration micros, N-Body, and the in-place writers in sparse delta):

iterateWithWrite — tight loop writing back a mutated Position per entity.
NBodyBenchmark.simulateOneTick — integrator system doing pos = pos + vel * dt on every body.
SparseDeltaBenchmark driver-side writes — 100 setComponent calls per tick, each allocating a new component record.

On each of these, Dominion and Artemis come out 3–10× ahead of japes. Those libraries' components are mutable POJO classes, and the write is a single field store:

// Dominion / Artemis — mutable component
pos.x += vel.dx;
pos.y += vel.dy;

No allocation. No change tracking. Nothing else happens. The JIT keeps this loop in SIMD-ish registers on every hot benchmark and wins by its architecture.

japes's equivalent is:

// japes — immutable record + Mut<T> write path
@System
void integrate(@Read Velocity v, @Write Mut<Position> p, Res<Dt> dt) {
    var pos = p.get();
    p.set(new Position(
        pos.x() + v.dx() * dt.get().value(),
        pos.y() + v.dy() * dt.get().value()
    ));
}

Every invocation allocates a fresh Position record, hands it to Mut.set, and the flush path writes it back to the component storage and marks the slot dirty on the component's ChangeTracker. You pay:

One new Position(...) — heap allocation (6–8 bytes header + fields).
One Mut.set field store.
One ComponentStorage.set(slot, value) — aastore.
One ChangeTracker.markChanged(slot, tick) — one tick-array store, one dirty-bitmap test, one dirty-list append (if the component is tracked).

That's 4–5 memory operations plus an allocation, versus Dominion's one field store. The ratio in the benchmark is the ratio of those costs, exactly as you would expect.

This is not the tier-1 generator being slow

The generated bytecode is already doing the minimum amount of work consistent with the API contract: hoisting the Mut<T> into a local, calling setContext(tracker, tick) once per chunk, and calling resetValue(value, slot) per entity. There is no reflection on the hot path, no boxing, no hash lookup. The cost is the contract.

What the contract gives you¶

Every cost item above corresponds to a semantic feature:

Allocation of a new Position means components are immutable records. The benefits:
- No accidental sharing — you cannot hand out a reference to a component from one system and have another system mutate it out from under you.
- Value-based equality is free and correct. Records give you equals and hashCode keyed on the fields.
- No thread-safety trap — a Position value pulled from storage in one thread cannot be observed mutating in another.
- Pattern matching and destructuring work. if (pos instanceof Position(var x, var y)) is free.
Mut.set + flush means the write is captured in an intermediate object. Benefits:
- The framework can decide whether the write actually happened. @ValueTracked records use equals to suppress a markChanged when the new value equals the old — e.g. if a damage system writes Health(h, max) but h is unchanged, no observer runs.
- Writes are flushed at the end of the system body, not mid-iteration, so the user can read p.get() multiple times without triggering intermediate observer fanout.
markChanged on the tracker means @Filter(Changed) observers work. This is the whole change-detection subsystem in one store. See change tracking for what runs on top of it.
ComponentStorage.set writes the new immutable value into the parallel column, so subsequent reads in the same or later ticks see the new value. Same cost a mutable POJO would pay if it stored its component by reference.

You cannot buy any of the top three items for free on a mutable-POJO API. The closest Dominion / Artemis users can come is:

Hand-write a dirty list on every mutation site (see the "sparse delta" commentary in the benchmark — they append to an ArrayList<Entity> or IntBag per call).
Hand-write observer loops that iterate that dirty list and call their reaction logic.
Remember to maintain both on every code path that mutates the component.

At every code site. Forever. On every new system you add.

The sparse-delta cell is the honest comparison¶

If you only look at the tightest write microbenchmark, Dominion wins by ~8× and Artemis by ~13×. That's real for those micros. But SparseDeltaBenchmark is the workload the library-change-detection path is built for — 100 dirty entities out of 10 000 per tick — and once you look there, the picture flips. From DEEP_DIVE.md section 5:

japes: 1.88 µs/op (library change detection, zero user bookkeeping)
Bevy: 4.11 µs/op (native Rust change detection)
Zay-ES: 4.67 µs/op (library change detection)
Dominion: 0.37 µs/op (hand-rolled dirty list)
Artemis: 0.27 µs/op (hand-rolled dirty list)

japes is 2.19× faster than Bevy on the library-change-detection workload. Dominion is 5× faster than japes — but only by hand-rolling the exact dirty-list machinery that japes ships in the box, only for this one component, only for this one observer, only as long as the user remembers to append to the dirty buffer at every mutation site for the lifetime of the project.

When you pair the per-write cost with the observer-side work, the arithmetic comes out in japes's favour the moment you have more than one observer on more than one component. The realistic multi-observer tick benchmark shows exactly this: japes is 1.50× faster than Bevy at 10k and 9.72× faster at 100k, because Bevy pays O(N) per observer per tick and japes pays O(K).

Composition, parallelism, and `RemovedComponents`¶

Three more costs a hand-rolled dirty list doesn't ship with.

Composition. N observers × M mutation sites = N×M append calls that must stay in sync. The library indexes this once, centrally, with @Filter annotations. Adding a new observer is one annotation on one method; in hand-rolled land it's "find every mutation site and add another append to the new observer's buffer."

Parallelism. japes's scheduler reads the @Read / @Write annotations and runs disjoint systems in parallel for free. A hand-rolled dirty list has no access metadata so the scheduler can't help; if you want multi-core you wire up an ExecutorService, a join barrier, and the three systems' disjoint- component audit yourself.

Added / Removed. @Filter(Added.class) and RemovedComponents<T> are siblings of @Filter(Changed.class) sharing most of the same machinery. In the hand-rolled approach each one is a separate buffer you have to append to from every create / destroy / remove site.

What Valhalla could fix (and what it currently doesn't)¶

The write-path tax's single biggest component is the per-write new Position(...) allocation. Under JEP 401 (value records), a value record Position on a Valhalla EA build should be flatten- able through escape analysis: the JIT can prove the short-lived Position instance doesn't escape the inner loop and fold it into registers, at which point the allocation disappears and the write becomes a plain store to the backing flat array.

Measured on the iterateWithWrite row from the Valhalla investigation page:

Stock JDK 26: 38.5 µs at 10k, 377 µs at 100k.
Valhalla EA with value record: 53.2 µs at 10k (0.72× slower), 536 µs at 100k (0.70× slower).

Stock JDK 26 is now faster than Valhalla on writes. The blocker is that World.setComponent takes Record as its declared parameter type, which erases through the call site and forces the JVM to box the value record into a heap wrapper crossing that boundary even though the storage layer is value-aware. The dispatch chain can't keep the value unboxed the whole way.

The read side has no such boundary. iterateSingleComponent at 100k is ~37.5 µs stock and ~9.31 µs under Valhalla — 4.03×, the biggest single gain in the whole benchmark suite. Reads go chunk → storage → value without crossing any erased boundary, so the flat-array layout delivers its full speedup.

The flat-array opt-in is a regression today

DefaultComponentStorage has a -Dzzuegg.ecs.useFlatStorage=true opt-in that wires jdk.internal.value.ValueClass.newNullRestrictedNonAtomicArray into the storage constructor (see the static initialiser at DefaultComponentStorage.java lines ~27–60). It makes the backing array literally flat — ValueClass.isFlatArray returns true. But the EA JIT has not yet optimised the flat-array get/set code path, and an A/B comparison showed it ~3.5× slower on iteration reads. The opt-in is kept because it'll become the right default once the JIT catches up, but for now the reference-array fallback is faster on every cell. See valhalla investigation for the table.

Recommendation¶

If you're writing a physics engine that needs every last nanosecond on the integration loop and you are willing to maintain your own dirty-list plumbing at every mutation site — Dominion or Artemis will win every tight-loop microbenchmark in this repo.

If you are writing a game loop with systems that compose, parallelise, and react to mutations, and you want the compiler and the library to carry the correctness burden — the write-path tax is ~3 ns/entity on the tier-1 generator and you get change detection, parallelism, and removed-component observability for free. Every benchmark with more than one observer on more than one component shows this paying off; the realistic-tick row is where it shows most clearly.

The tax is the contract. The library is built around making it as small as possible — tier-1 generation, chunk-hoisted setContext, raw Record[] direct stores, untracked short-circuit for unobserved components — but it cannot be driven to zero without giving up the semantics it exists to provide.

Change tracking — the data structures on the receiving end of every markChanged call
Architecture — what ComponentStorage.set(slot, value) is actually writing into
Tier-1 bytecode generation — how Mut<T>.setContext + resetValue + flush become straight-line bytecode with no allocation on the hot path
Valhalla investigation — the current state of play on whether JEP 401 can eliminate any of this
Benchmarks — iteration micros — the tables this page is explaining
Benchmarks — realistic tick — the workload where the trade-off flips in japes's favour