The write-path tax¶
What you'll learn: why japes measurably loses the "naked writes"
microbenchmarks to Dominion and Artemis, why that is an API trade-
off rather than an engineering regression, what you pay per write
and what you get in return, and what Valhalla could fix about it
once the EA JIT stops boxing value records across the erased
Record parameter boundary.
The benchmarks that show it¶
Three rows in the benchmark suite are dominated by the write-path tax (see iteration micros, N-Body, and the in-place writers in sparse delta):
iterateWithWrite— tight loop writing back a mutatedPositionper entity.NBodyBenchmark.simulateOneTick— integrator system doingpos = pos + vel * dton every body.SparseDeltaBenchmarkdriver-side writes — 100setComponentcalls per tick, each allocating a new component record.
On each of these, Dominion and Artemis come out 3–10× ahead of japes. Those libraries' components are mutable POJO classes, and the write is a single field store:
No allocation. No change tracking. Nothing else happens. The JIT keeps this loop in SIMD-ish registers on every hot benchmark and wins by its architecture.
japes's equivalent is:
// japes — immutable record + Mut<T> write path
@System
void integrate(@Read Velocity v, @Write Mut<Position> p, Res<Dt> dt) {
var pos = p.get();
p.set(new Position(
pos.x() + v.dx() * dt.get().value(),
pos.y() + v.dy() * dt.get().value()
));
}
Every invocation allocates a fresh Position record, hands it to
Mut.set, and the flush path writes it back to the component
storage and marks the slot dirty on the component's
ChangeTracker. You pay:
- One
new Position(...)— heap allocation (6–8 bytes header + fields). - One
Mut.setfield store. - One
ComponentStorage.set(slot, value)—aastore. - One
ChangeTracker.markChanged(slot, tick)— one tick-array store, one dirty-bitmap test, one dirty-list append (if the component is tracked).
That's 4–5 memory operations plus an allocation, versus Dominion's one field store. The ratio in the benchmark is the ratio of those costs, exactly as you would expect.
This is not the tier-1 generator being slow
The generated bytecode is already doing the minimum amount of
work consistent with the API contract: hoisting the Mut<T>
into a local, calling setContext(tracker, tick) once per
chunk, and calling resetValue(value, slot) per entity. There
is no reflection on the hot path, no boxing, no hash lookup.
The cost is the contract.
What the contract gives you¶
Every cost item above corresponds to a semantic feature:
- Allocation of a new
Positionmeans components are immutable records. The benefits:- No accidental sharing — you cannot hand out a reference to a component from one system and have another system mutate it out from under you.
- Value-based equality is free and correct. Records give you
equalsandhashCodekeyed on the fields. - No thread-safety trap — a
Positionvalue pulled from storage in one thread cannot be observed mutating in another. - Pattern matching and destructuring work.
if (pos instanceof Position(var x, var y))is free.
Mut.set+ flush means the write is captured in an intermediate object. Benefits:- The framework can decide whether the write actually happened.
@ValueTrackedrecords useequalsto suppress amarkChangedwhen the new value equals the old — e.g. if a damage system writesHealth(h, max)buthis unchanged, no observer runs. - Writes are flushed at the end of the system body, not
mid-iteration, so the user can read
p.get()multiple times without triggering intermediate observer fanout.
- The framework can decide whether the write actually happened.
markChangedon the tracker means@Filter(Changed)observers work. This is the whole change-detection subsystem in one store. See change tracking for what runs on top of it.ComponentStorage.setwrites the new immutable value into the parallel column, so subsequent reads in the same or later ticks see the new value. Same cost a mutable POJO would pay if it stored its component by reference.
You cannot buy any of the top three items for free on a mutable-POJO API. The closest Dominion / Artemis users can come is:
- Hand-write a dirty list on every mutation site (see the
"sparse delta" commentary in the benchmark — they append to an
ArrayList<Entity>orIntBagper call). - Hand-write observer loops that iterate that dirty list and call their reaction logic.
- Remember to maintain both on every code path that mutates the component.
At every code site. Forever. On every new system you add.
The sparse-delta cell is the honest comparison¶
If you only look at the tightest write microbenchmark, Dominion
wins by ~8× and Artemis by ~13×. That's real for those micros. But
SparseDeltaBenchmark is the workload the library-change-detection
path is built for — 100 dirty entities out of 10 000 per tick —
and once you look there, the picture flips. From
DEEP_DIVE.md section 5:
- japes: 1.88 µs/op (library change detection, zero user bookkeeping)
- Bevy: 4.11 µs/op (native Rust change detection)
- Zay-ES: 4.67 µs/op (library change detection)
- Dominion: 0.37 µs/op (hand-rolled dirty list)
- Artemis: 0.27 µs/op (hand-rolled dirty list)
japes is 2.19× faster than Bevy on the library-change-detection workload. Dominion is 5× faster than japes — but only by hand-rolling the exact dirty-list machinery that japes ships in the box, only for this one component, only for this one observer, only as long as the user remembers to append to the dirty buffer at every mutation site for the lifetime of the project.
When you pair the per-write cost with the observer-side work, the arithmetic comes out in japes's favour the moment you have more than one observer on more than one component. The realistic multi-observer tick benchmark shows exactly this: japes is 1.50× faster than Bevy at 10k and 9.72× faster at 100k, because Bevy pays O(N) per observer per tick and japes pays O(K).
Composition, parallelism, and RemovedComponents¶
Three more costs a hand-rolled dirty list doesn't ship with.
Composition. N observers × M mutation sites = N×M append calls
that must stay in sync. The library indexes this once, centrally,
with @Filter annotations. Adding a new observer is one annotation
on one method; in hand-rolled land it's "find every mutation site
and add another append to the new observer's buffer."
Parallelism. japes's scheduler reads the @Read / @Write
annotations and runs disjoint systems in parallel for free. A
hand-rolled dirty list has no access metadata so the scheduler
can't help; if you want multi-core you wire up an
ExecutorService, a join barrier, and the three systems' disjoint-
component audit yourself.
Added / Removed. @Filter(Added.class) and
RemovedComponents<T> are siblings of @Filter(Changed.class)
sharing most of the same machinery. In the hand-rolled approach
each one is a separate buffer you have to append to from every
create / destroy / remove site.
What Valhalla could fix (and what it currently doesn't)¶
The write-path tax's single biggest component is the per-write
new Position(...) allocation. Under JEP 401 (value records), a
value record Position on a Valhalla EA build should be flatten-
able through escape analysis: the JIT can prove the short-lived
Position instance doesn't escape the inner loop and fold it into
registers, at which point the allocation disappears and the write
becomes a plain store to the backing flat array.
Measured on the iterateWithWrite row from
the Valhalla investigation page:
- Stock JDK 26: 38.5 µs at 10k, 377 µs at 100k.
- Valhalla EA with
value record: 53.2 µs at 10k (0.72× slower), 536 µs at 100k (0.70× slower).
Stock JDK 26 is now faster than Valhalla on writes. The blocker is that
World.setComponent takes Record as its declared parameter type,
which erases through the call site and forces the JVM to box the
value record into a heap wrapper crossing that boundary even
though the storage layer is value-aware. The dispatch chain can't
keep the value unboxed the whole way.
The read side has no such boundary. iterateSingleComponent at
100k is ~37.5 µs stock and ~9.31 µs under Valhalla — 4.03×, the
biggest single gain in the whole benchmark suite. Reads go
chunk → storage → value without crossing any erased boundary, so
the flat-array layout delivers its full speedup.
The flat-array opt-in is a regression today
DefaultComponentStorage has a -Dzzuegg.ecs.useFlatStorage=true
opt-in that wires
jdk.internal.value.ValueClass.newNullRestrictedNonAtomicArray
into the storage constructor (see the static initialiser at
DefaultComponentStorage.java lines ~27–60). It makes the
backing array literally flat — ValueClass.isFlatArray returns
true. But the EA JIT has not yet optimised the flat-array
get/set code path, and an A/B comparison showed it ~3.5× slower
on iteration reads. The opt-in is kept because it'll become the
right default once the JIT catches up, but for now the
reference-array fallback is faster on every cell. See
valhalla investigation for the
table.
Recommendation¶
If you're writing a physics engine that needs every last nanosecond on the integration loop and you are willing to maintain your own dirty-list plumbing at every mutation site — Dominion or Artemis will win every tight-loop microbenchmark in this repo.
If you are writing a game loop with systems that compose,
parallelise, and react to mutations, and you want the compiler and
the library to carry the correctness burden — the write-path tax
is ~3 ns/entity on the tier-1 generator and you get change
detection, parallelism, and removed-component observability for
free. Every benchmark with more than one observer on more than one
component shows this paying off; the realistic-tick row is where
it shows most clearly.
The tax is the contract. The library is built around making it as
small as possible — tier-1 generation, chunk-hoisted setContext,
raw Record[] direct stores, untracked short-circuit for
unobserved components — but it cannot be driven to zero without
giving up the semantics it exists to provide.
Related¶
- Change tracking — the data structures on
the receiving end of every
markChangedcall - Architecture — what
ComponentStorage.set(slot, value)is actually writing into - Tier-1 bytecode generation — how
Mut<T>.setContext+resetValue+flushbecome straight-line bytecode with no allocation on the hot path - Valhalla investigation — the current state of play on whether JEP 401 can eliminate any of this
- Benchmarks — iteration micros — the tables this page is explaining
- Benchmarks — realistic tick — the workload where the trade-off flips in japes's favour