Realistic multi-observer tick¶
RealisticTickBenchmark is the shape a real game-loop actually has.
It exists to answer the obvious follow-up to the sparse-delta
page: "but what if I add more observers?"
What this workload measures
- Population. 10 000 (and, for the scaling cell, 100 000)
entities carrying
{Position, Velocity, Health, Mana}. - Mutation. 1% turnover per tick — 100 sparse mutations per component, three rotating cursors so the three slices don't overlap.
- Observation. Three
@Filter(Changed)observers, one per component, each accumulating a per-observer sum into a sharedStatsresource. - Executors. Two variants:
st(single-threaded scheduler) andmt(japesMultiThreadedExecutor, ForkJoinPool-backed, parallelises disjoint systems automatically).
Counterparts exist for every library in the comparison. Bevy and
Zay-ES use their native change-detection primitives (Changed<T>
query filter and EntitySet.getChangedEntities() respectively).
Dominion and Artemis have no built-in change detection, so their
observer passes do full iterations over every entity with the
component — the "lazy user" path. The mt variant in Dominion /
Artemis dispatches the three observer passes to a fixed
ExecutorService, exactly what japes does for you from the
declared system access metadata.
Results¶
100 dirty per component per tick, µs/op — lower is better. Copied
verbatim from DEEP_DIVE.md.
| library | 10k µs/op | 100k µs/op | scaling | cost model |
|---|---|---|---|---|
| japes st | 6.94 | 12.7 | 1.83× | dirty-list skip (scales with K) |
| zay-es | 15.4 | 19.6 | 1.27× | dirty-list skip (scales with K) |
| bevy (native Rust) | 8.81 | 76.9 | 8.73× | full archetype scan (scales w/ N) |
| artemis st | 24.5 | 279 | 11.4× | full archetype scan (no CD) |
| dominion st | 44.6 | 389 | 8.72× | full archetype scan (no CD) |
The libraries split into two cost-model camps, empirically:
- Dirty-list skip (japes, Zay-ES) — a per-archetype list of slot
indices that were mutated since the last prune.
@Filter(Changed)/EntitySet.getChangedEntities()walks only that list. Per-tick cost is O(K) where K is the dirty count, not O(N) where N is total entities. Scaling from 10k→100k costs ~35% more on japes (larger handle array for the driver'sgetComponentlookups) and ~27% more on Zay-ES. - Full-archetype scan (Bevy, Dominion, Artemis) — observers
iterate the full archetype and either tick-compare every entity
(Bevy's
Changed<T>) or walk every component regardless (DominionfindEntitiesWith, ArtemisIteratingSystemwith no filter). Per-tick cost is O(N) because that's the algorithmic shape. Scaling from 10k→100k costs ~8–11× more.
At 10k entities japes beats Bevy by 1.27×. The gap looks modest because 10k is small enough that Bevy's tight cache-friendly tick scan is only paying ~3 µs of pure scan cost. At 100k entities the same workload is a 6.06× gap — Bevy pays ~64 µs extra to scan 90 000 more tick words that japes never touches.
Worth calling out: Zay-ES beats Bevy at 100k (19.6 vs 76.9).
Zay-ES has higher per-mutation overhead than japes (more allocations
in the driver side, per-set applyChanges() calls) but its
EntitySet.getChangedEntities() is a dirty-list skip, so it scales
the same shape as japes. The two dirty-list libraries stay in the
same cost bucket at any entity count; the three scan libraries scale
out of it past ~50k.
How the two cost models separate¶
The per-additional-entity cost at the 10k → 100k step tells the whole story:
| library | Δ µs for Δ 90k entities | per-entity overhead |
|---|---|---|
| japes st | +5.76 | 64 ns / entity |
| zay-es | +4.20 | 47 ns / entity |
| bevy | +68.1 | 757 ns / entity |
| artemis st | +254 | 2 828 ns / entity |
| dominion st | +344 | 3 827 ns / entity |
japes's ~64 ns/entity is driver-side cost (the handle list grows, the archetype's chunk list grows, SoA field-array resizing). The observer side is ~flat because the dirty list is still 300 slots.
Bevy's ~757 ns/entity breaks down as 3 observers × ~252 ns = each observer does roughly one tick-word load + compare + branch per entity, which at ~0.25 ns/check × 100k entities × 3 observers ≈ 76 µs. Matches.
Dominion / Artemis pay more per entity because their full scans
happen in the user-facing benchmark driver too (each observer calls
findEntitiesWith / IteratingSystem.process which rebuilds its
iterator state), not just inside a tight Bevy-style Changed<T>
filter.
Why Bevy doesn't ship a dirty-slot list for Changed<T>
It's a deliberate API trade-off, not a missed optimisation. Tick-per-slot is cheaper per mutation (one store, no dedup, no append), which matters for Bevy's target workload — dense simulation where most components get touched every tick and the dirty list would contain most of the world. The catch with the dirty-list is opposite: it wins on sparse delta, loses on dense.
japes pays ~5–10 ns extra per mutation for the dirty-list
maintenance, which is invisible at 300 mutations/tick (total
~3 µs) but would start to hurt at millions of mutations/tick.
Run japes on iterateWithWrite (every entity touched every
tick, K = N) and Bevy wins by ~6× — the opposite direction,
same cost model.
DCE safety¶
Before trusting these numbers, the obvious question is "are we
hitting a dead-code-elimination trap anywhere?" The Bevy observer
body writes into ResMut<RtStats> which is never read outside the
benchmark closure — if the compiler can prove the writes have no
observable effect, it's allowed to delete the observer bodies
entirely.
Explicitly checked:
- japes. The
@Benchmarkbody callsbh.consume(stats.sumX)/sumHp/sumManaat the end of every tick. JMH'sBlackhole.consumeis opaque to the JIT, so the accumulation chain is preserved. - Bevy. The
b.iter(||)closure callsworld.resource::<RtStats>()+std::hint::black_box(stats.sum_x)/sum_hp/sum_manaafterschedule.run.black_boxis rustc's equivalent ofBlackhole.consume.
Re-ran Bevy after adding the black_box guards: result 8.81 µs at
10k (was 8.80 µs without the guard). Delta is pure measurement
noise, which means DCE wasn't happening even without the guard —
the cross-crate call chain
schedule.run → system fn pointer → observer body already defeats
rustc's DCE at the default cargo bench opt level (opt-level = 3,
no LTO). The guard is there as insurance for future readers.
Same-work audit — driver parity¶
Each library's driver does 300 sparse mutations per tick via three rotating cursors. The operation shapes differ slightly:
| library | operation per mutation | per-mutation alloc |
|---|---|---|
| japes | world.setComponent(e, new Position(...)) (new record) |
allocates |
| zay-es | data.setComponent(id, new Position(...)) |
allocates |
| bevy | world.get_mut::<Position>(e).x += 1.0 |
in-place |
| dominion | e.get(Position.class).x += 1 (mutable POJO) |
in-place |
| artemis | pm.get(e).x += 1 (mutable Component subclass) |
in-place |
This is an asymmetry on the driver side: japes and Zay-ES allocate 300 record instances per tick that Bevy / Dominion / Artemis don't. Direction of the asymmetry: favours Bevy / Dominion / Artemis. japes is paying extra allocation cost its comparison-peers aren't — and still winning. If we fixed the asymmetry (either by making japes's driver mutate in place somehow, or by making Bevy's driver allocate new records), the 9.72× gap at 100k would widen further, not shrink.
Code comparison (single-threaded path)¶
The japes observer is 11 lines including the class declaration:
public static final class HealthObserver {
final Stats stats;
HealthObserver(Stats stats) { this.stats = stats; }
@System(stage = "PostUpdate")
@Filter(value = Changed.class, target = Health.class)
void observe(@Read Health h) {
stats.sumHp += h.hp();
}
}
// ... and the builder:
World.builder()
.executor(Executors.multiThreaded()) // <-- parallelism opt-in
.addSystem(new PositionObserver(stats))
.addSystem(new HealthObserver(stats))
.addSystem(new ManaObserver(stats))
.build();
// scheduler knows these observers read disjoint components
// and fans them out across the ForkJoinPool automatically.
And the Dominion / Artemis counterpart (in the mt path — the st
path is similar but without the executor):
ExecutorService pool = Executors.newFixedThreadPool(3);
private long observePositions() {
long sum = 0;
var rs = world.findEntitiesWith(Position.class);
for (var r : rs) sum += (long) r.comp().x;
return sum;
}
// ... and two more observer functions ...
public void tick() {
// ... sparse mutations ...
var f1 = pool.submit(this::observePositions);
var f2 = pool.submit(this::observeHealths);
var f3 = pool.submit(this::observeManas);
sumX += f1.get();
sumHp += f2.get();
sumMana += f3.get();
}
The Dominion / Artemis version has to know a priori that the three
observers don't conflict, know to dispatch them in parallel, own
the thread-pool lifecycle, and do all of this over again every time
an observer is added or removed. japes knows all of that from
@Read / @Write / @Filter annotations and the scheduler's DAG
builder.
Valhalla delta¶
| benchmark | case | japes | japes-v | Δ |
|---|---|---|---|---|
RealisticTick tick |
10k / st | 6.94 | 11.9 | — |
RealisticTick tick |
10k / mt | — | 17.8 | — |
The Valhalla numbers are from the pre-SoA sweep and are not directly comparable to the new stock numbers. A fresh Valhalla sweep with SoA storage is pending. See the Valhalla page for the previous breakdown.