Objective

Speed up inserting and applying commands.
Halve the stack size of CommandQueue to 24 bytes.
Require fewer allocations.

Solution

Store commands and metadata densely within the same buffer. Each command takes up 1 usize of metadata, plus the bytes to store the command itself. Zero-sized types take up no space except for the metadata.

Benchmarks

All of the benchmarks related to commands.

Bench	Time	% Change	p-value
empty_commands/0_entities	4.7780 ns	-18.381%	0.00
spawn_commands/2000_entities	233.11 us	-0.9961%	0.00
spawn_commands/4000_entities	448.38 us	-3.1466%	0.00
spawn_commands/6000_entities	693.12 us	-0.3978%	0.52
spawn_commands/8000_entities	889.48 us	-2.8802%	0.00
insert_commands/insert	609.95 us	-4.8604%	0.00
insert_commands/insert_batch	355.54 us	-2.8165%	0.00
fake_commands/2000_commands	4.8018 us	-17.802%	0.00
fake_commands/4000_commands	9.5969 us	-17.337%	0.00
fake_commands/6000_commands	14.421 us	-18.454%	0.00
fake_commands/8000_commands	19.192 us	-18.261%	0.00
sized_commands_0_bytes/2000_commands	4.0593 us	-4.7145%	0.00
sized_commands_0_bytes/4000_commands	8.1541 us	-4.9470%	0.00
sized_commands_0_bytes/6000_commands	12.806 us	-12.017%	0.00
sized_commands_0_bytes/8000_commands	17.096 us	-14.070%	0.00
sized_commands_12_bytes/2000_commands	5.3425 us	-27.632%	0.00
sized_commands_12_bytes/4000_commands	10.283 us	-31.158%	0.00
sized_commands_12_bytes/6000_commands	15.339 us	-31.418%	0.00
sized_commands_12_bytes/8000_commands	20.206 us	-33.133%	0.00
sized_commands_512_bytes/2000_commands	99.118 us	-9.9655%	0.00
sized_commands_512_bytes/4000_commands	201.96 us	-8.8235%	0.00
sized_commands_512_bytes/6000_commands	300.95 us	-9.2344%	0.00
sized_commands_512_bytes/8000_commands	404.69 us	-8.4578%	0.00

Oct 27 '22 22:10 joseph-gio

Impressive gains! I think these are worth the added complexity. This will need careful review due to the pointer arithmetic.

Oct 27 '22 23:10 alice-i-cecile

Really cool!

Worth mentioning that this change does seem to go in a different direction compared to the "command batching" ideas that have been discussed on Discord recently.

One way to "batch" commands at application time (batch meaning minimize the number of table moves) without new allocations (using only existing structures) would require more fields on CommandMeta + using CommandMeta to random access the stored commands, but these changes make it infeasible to do anything but "drain and apply the queued commands."

Oct 28 '22 03:10 maniwani

Nothing prohibits discord discussions into becoming PRs that may reverse this change. Until then, I think this is a nice, concrete and present progression forward.

Oct 28 '22 10:10 CGMossa

Spoke too soon. Discussed with @JoJoJet and we think that style of batching can still be done without random access, so no real conflict.

Oct 28 '22 18:10 maniwani

Just did an extra pass to improve code quality -- mostly I just rewrote the safety comments. I also changed a type to OwningPtr, to encode its invariants in the type system instead of with comments.

Nov 10 '22 16:11 joseph-gio

Marking as a draft since I need to investigate some regressions.

For posterity's sake, I'd like to document some other optimizations I have attempted for CommandQueue

https://github.com/JoJoJet/bevy/commits/command-alloc
- Uses a version of the approach used in this PR, modified to add padding which ensure that both CommandMeta and stored commands are aligned.
https://github.com/JoJoJet/bevy/tree/command-queue-aligned-block
- Based on the CommandQueue implementation on main. Adds padding between commands to ensure that they are aligned. Metadata is stored separately, and is thus always aligned.

In both of these approaches, stored commands are aligned unless they have a very strict alignment requirement. In this case, a compile-time check switches to using unaligned reads and writes, which allows for maximum flexibility at zero runtime cost. Unfortunately, I have not measured any performance improvements on either branch. I even saw some small regressions. This may be due to the padding bytes making the queue less memory efficient.

Jan 06 '23 04:01 joseph-gio

I have rebased and rerun the benchmarks. % change is relative to main, as of today.

Bench	Time	% Change
empty_commands/0_entities	4.8304 ns	-24.227%
spawn_commands/2000_entities	181.10 µs	-2.3501%
spawn_commands/4000_entities	354.78 µs	-3.7890%
spawn_commands/6000_entities	527.82 µs	-2.4764%
spawn_commands/8000_entities	708.85 µs	-5.4541%
insert_commands/insert	456.93 µs	-3.1870%
insert_commands/insert_batch	325.57 µs	-0.9230%
fake_commands/2000_commands	5.3598 µs	-9.0157%
fake_commands/4000_commands	10.684 µs	-10.223%
fake_commands/6000_commands	16.032 µs	-10.243%
fake_commands/8000_commands	21.464 µs	-9.5206%
sized_commands_0_bytes/2000_commands	3.2335 µs	-37.436%
sized_commands_0_bytes/4000_commands	6.4670 µs	-37.474%
sized_commands_0_bytes/6000_commands	10.352 µs	-33.003%
sized_commands_0_bytes/8000_commands	13.856 µs	-32.966%
sized_commands_12_bytes/2000_commands	4.8644 µs	-33.427%
sized_commands_12_bytes/4000_commands	9.8333 µs	-32.372%
sized_commands_12_bytes/6000_commands	14.754 µs	-33.928%
sized_commands_12_bytes/8000_commands	19.743 µs	-33.360%
sized_commands_512_bytes/2000_commands	59.321 µs	+3.5436%
sized_commands_512_bytes/4000_commands	119.70 µs	+2.7667%
sized_commands_512_bytes/8000_commands	243.87 µs	+2.7726%

Jan 26 '23 21:01 joseph-gio

bors r+

Jan 27 '23 17:01 alice-i-cecile

Canceled.

Jan 27 '23 17:01 bors[bot]

Pull request successfully merged into main.

Build succeeded:

Jan 28 '23 01:01 bors[bot]

Speed up `CommandQueue` by storing commands more densely

Objective

Solution

Benchmarks