Speed up `CommandQueue` by storing commands more densely
Objective
- Speed up inserting and applying commands.
- Halve the stack size of
CommandQueueto 24 bytes. - Require fewer allocations.
Solution
Store commands and metadata densely within the same buffer. Each command takes up 1 usize of metadata, plus the bytes to store the command itself. Zero-sized types take up no space except for the metadata.
Benchmarks
All of the benchmarks related to commands.
| Bench | Time | % Change | p-value |
|---|---|---|---|
| empty_commands/0_entities | 4.7780 ns | -18.381% | 0.00 |
| spawn_commands/2000_entities | 233.11 us | -0.9961% | 0.00 |
| spawn_commands/4000_entities | 448.38 us | -3.1466% | 0.00 |
| spawn_commands/6000_entities | 693.12 us | -0.3978% | 0.52 |
| spawn_commands/8000_entities | 889.48 us | -2.8802% | 0.00 |
| insert_commands/insert | 609.95 us | -4.8604% | 0.00 |
| insert_commands/insert_batch | 355.54 us | -2.8165% | 0.00 |
| fake_commands/2000_commands | 4.8018 us | -17.802% | 0.00 |
| fake_commands/4000_commands | 9.5969 us | -17.337% | 0.00 |
| fake_commands/6000_commands | 14.421 us | -18.454% | 0.00 |
| fake_commands/8000_commands | 19.192 us | -18.261% | 0.00 |
| sized_commands_0_bytes/2000_commands | 4.0593 us | -4.7145% | 0.00 |
| sized_commands_0_bytes/4000_commands | 8.1541 us | -4.9470% | 0.00 |
| sized_commands_0_bytes/6000_commands | 12.806 us | -12.017% | 0.00 |
| sized_commands_0_bytes/8000_commands | 17.096 us | -14.070% | 0.00 |
| sized_commands_12_bytes/2000_commands | 5.3425 us | -27.632% | 0.00 |
| sized_commands_12_bytes/4000_commands | 10.283 us | -31.158% | 0.00 |
| sized_commands_12_bytes/6000_commands | 15.339 us | -31.418% | 0.00 |
| sized_commands_12_bytes/8000_commands | 20.206 us | -33.133% | 0.00 |
| sized_commands_512_bytes/2000_commands | 99.118 us | -9.9655% | 0.00 |
| sized_commands_512_bytes/4000_commands | 201.96 us | -8.8235% | 0.00 |
| sized_commands_512_bytes/6000_commands | 300.95 us | -9.2344% | 0.00 |
| sized_commands_512_bytes/8000_commands | 404.69 us | -8.4578% | 0.00 |
Impressive gains! I think these are worth the added complexity. This will need careful review due to the pointer arithmetic.
Really cool!
Worth mentioning that this change does seem to go in a different direction compared to the "command batching" ideas that have been discussed on Discord recently.
One way to "batch" commands at application time (batch meaning minimize the number of table moves) without new allocations (using only existing structures) would require more fields on CommandMeta + using CommandMeta to random access the stored commands, but these changes make it infeasible to do anything but "drain and apply the queued commands."
Nothing prohibits discord discussions into becoming PRs that may reverse this change. Until then, I think this is a nice, concrete and present progression forward.
Spoke too soon. Discussed with @JoJoJet and we think that style of batching can still be done without random access, so no real conflict.
Just did an extra pass to improve code quality -- mostly I just rewrote the safety comments. I also changed a type to OwningPtr, to encode its invariants in the type system instead of with comments.
Marking as a draft since I need to investigate some regressions.
For posterity's sake, I'd like to document some other optimizations I have attempted for CommandQueue
- https://github.com/JoJoJet/bevy/commits/command-alloc
- Uses a version of the approach used in this PR, modified to add padding which ensure that both
CommandMetaand stored commands are aligned.
- Uses a version of the approach used in this PR, modified to add padding which ensure that both
- https://github.com/JoJoJet/bevy/tree/command-queue-aligned-block
- Based on the
CommandQueueimplementation on main. Adds padding between commands to ensure that they are aligned. Metadata is stored separately, and is thus always aligned.
- Based on the
In both of these approaches, stored commands are aligned unless they have a very strict alignment requirement. In this case, a compile-time check switches to using unaligned reads and writes, which allows for maximum flexibility at zero runtime cost. Unfortunately, I have not measured any performance improvements on either branch. I even saw some small regressions. This may be due to the padding bytes making the queue less memory efficient.
I have rebased and rerun the benchmarks. % change is relative to main, as of today.
| Bench | Time | % Change |
|---|---|---|
| empty_commands/0_entities | 4.8304 ns | -24.227% |
| spawn_commands/2000_entities | 181.10 µs | -2.3501% |
| spawn_commands/4000_entities | 354.78 µs | -3.7890% |
| spawn_commands/6000_entities | 527.82 µs | -2.4764% |
| spawn_commands/8000_entities | 708.85 µs | -5.4541% |
| insert_commands/insert | 456.93 µs | -3.1870% |
| insert_commands/insert_batch | 325.57 µs | -0.9230% |
| fake_commands/2000_commands | 5.3598 µs | -9.0157% |
| fake_commands/4000_commands | 10.684 µs | -10.223% |
| fake_commands/6000_commands | 16.032 µs | -10.243% |
| fake_commands/8000_commands | 21.464 µs | -9.5206% |
| sized_commands_0_bytes/2000_commands | 3.2335 µs | -37.436% |
| sized_commands_0_bytes/4000_commands | 6.4670 µs | -37.474% |
| sized_commands_0_bytes/6000_commands | 10.352 µs | -33.003% |
| sized_commands_0_bytes/8000_commands | 13.856 µs | -32.966% |
| sized_commands_12_bytes/2000_commands | 4.8644 µs | -33.427% |
| sized_commands_12_bytes/4000_commands | 9.8333 µs | -32.372% |
| sized_commands_12_bytes/6000_commands | 14.754 µs | -33.928% |
| sized_commands_12_bytes/8000_commands | 19.743 µs | -33.360% |
| sized_commands_512_bytes/2000_commands | 59.321 µs | +3.5436% |
| sized_commands_512_bytes/4000_commands | 119.70 µs | +2.7667% |
| sized_commands_512_bytes/8000_commands | 243.87 µs | +2.7726% |
bors r+
Canceled.
Pull request successfully merged into main.
Build succeeded:
- build-and-install-on-iOS
- build-android
- build (macos-latest)
- build (ubuntu-latest)
- build-wasm
- build (windows-latest)
- build-without-default-features (bevy)
- build-without-default-features (bevy_ecs)
- build-without-default-features (bevy_reflect)
- check-compiles
- check-doc
- check-missing-examples-in-docs
- ci
- markdownlint
- msrv
- run-examples
- run-examples-on-wasm
- run-examples-on-windows-dx12