bevy icon indicating copy to clipboard operation
bevy copied to clipboard

Speed up `CommandQueue` by storing commands more densely

Open joseph-gio opened this issue 3 years ago • 5 comments

Objective

  • Speed up inserting and applying commands.
  • Halve the stack size of CommandQueue to 24 bytes.
  • Require fewer allocations.

Solution

Store commands and metadata densely within the same buffer. Each command takes up 1 usize of metadata, plus the bytes to store the command itself. Zero-sized types take up no space except for the metadata.

Benchmarks

All of the benchmarks related to commands.

Bench Time % Change p-value
empty_commands/0_entities 4.7780 ns -18.381% 0.00
spawn_commands/2000_entities 233.11 us -0.9961% 0.00
spawn_commands/4000_entities 448.38 us -3.1466% 0.00
spawn_commands/6000_entities 693.12 us -0.3978% 0.52
spawn_commands/8000_entities 889.48 us -2.8802% 0.00
insert_commands/insert 609.95 us -4.8604% 0.00
insert_commands/insert_batch 355.54 us -2.8165% 0.00
fake_commands/2000_commands 4.8018 us -17.802% 0.00
fake_commands/4000_commands 9.5969 us -17.337% 0.00
fake_commands/6000_commands 14.421 us -18.454% 0.00
fake_commands/8000_commands 19.192 us -18.261% 0.00
sized_commands_0_bytes/2000_commands 4.0593 us -4.7145% 0.00
sized_commands_0_bytes/4000_commands 8.1541 us -4.9470% 0.00
sized_commands_0_bytes/6000_commands 12.806 us -12.017% 0.00
sized_commands_0_bytes/8000_commands 17.096 us -14.070% 0.00
sized_commands_12_bytes/2000_commands 5.3425 us -27.632% 0.00
sized_commands_12_bytes/4000_commands 10.283 us -31.158% 0.00
sized_commands_12_bytes/6000_commands 15.339 us -31.418% 0.00
sized_commands_12_bytes/8000_commands 20.206 us -33.133% 0.00
sized_commands_512_bytes/2000_commands 99.118 us -9.9655% 0.00
sized_commands_512_bytes/4000_commands 201.96 us -8.8235% 0.00
sized_commands_512_bytes/6000_commands 300.95 us -9.2344% 0.00
sized_commands_512_bytes/8000_commands 404.69 us -8.4578% 0.00

joseph-gio avatar Oct 27 '22 22:10 joseph-gio

Impressive gains! I think these are worth the added complexity. This will need careful review due to the pointer arithmetic.

alice-i-cecile avatar Oct 27 '22 23:10 alice-i-cecile

Really cool!

Worth mentioning that this change does seem to go in a different direction compared to the "command batching" ideas that have been discussed on Discord recently.

One way to "batch" commands at application time (batch meaning minimize the number of table moves) without new allocations (using only existing structures) would require more fields on CommandMeta + using CommandMeta to random access the stored commands, but these changes make it infeasible to do anything but "drain and apply the queued commands."

maniwani avatar Oct 28 '22 03:10 maniwani

Nothing prohibits discord discussions into becoming PRs that may reverse this change. Until then, I think this is a nice, concrete and present progression forward.

CGMossa avatar Oct 28 '22 10:10 CGMossa

Spoke too soon. Discussed with @JoJoJet and we think that style of batching can still be done without random access, so no real conflict.

maniwani avatar Oct 28 '22 18:10 maniwani

Just did an extra pass to improve code quality -- mostly I just rewrote the safety comments. I also changed a type to OwningPtr, to encode its invariants in the type system instead of with comments.

joseph-gio avatar Nov 10 '22 16:11 joseph-gio

Marking as a draft since I need to investigate some regressions.

For posterity's sake, I'd like to document some other optimizations I have attempted for CommandQueue

  • https://github.com/JoJoJet/bevy/commits/command-alloc
    • Uses a version of the approach used in this PR, modified to add padding which ensure that both CommandMeta and stored commands are aligned.
  • https://github.com/JoJoJet/bevy/tree/command-queue-aligned-block
    • Based on the CommandQueue implementation on main. Adds padding between commands to ensure that they are aligned. Metadata is stored separately, and is thus always aligned.

In both of these approaches, stored commands are aligned unless they have a very strict alignment requirement. In this case, a compile-time check switches to using unaligned reads and writes, which allows for maximum flexibility at zero runtime cost. Unfortunately, I have not measured any performance improvements on either branch. I even saw some small regressions. This may be due to the padding bytes making the queue less memory efficient.

joseph-gio avatar Jan 06 '23 04:01 joseph-gio

I have rebased and rerun the benchmarks. % change is relative to main, as of today.

Bench Time % Change
empty_commands/0_entities 4.8304 ns -24.227%
spawn_commands/2000_entities 181.10 µs -2.3501%
spawn_commands/4000_entities 354.78 µs -3.7890%
spawn_commands/6000_entities 527.82 µs -2.4764%
spawn_commands/8000_entities 708.85 µs -5.4541%
insert_commands/insert 456.93 µs -3.1870%
insert_commands/insert_batch 325.57 µs -0.9230%
fake_commands/2000_commands 5.3598 µs -9.0157%
fake_commands/4000_commands 10.684 µs -10.223%
fake_commands/6000_commands 16.032 µs -10.243%
fake_commands/8000_commands 21.464 µs -9.5206%
sized_commands_0_bytes/2000_commands 3.2335 µs -37.436%
sized_commands_0_bytes/4000_commands 6.4670 µs -37.474%
sized_commands_0_bytes/6000_commands 10.352 µs -33.003%
sized_commands_0_bytes/8000_commands 13.856 µs -32.966%
sized_commands_12_bytes/2000_commands 4.8644 µs -33.427%
sized_commands_12_bytes/4000_commands 9.8333 µs -32.372%
sized_commands_12_bytes/6000_commands 14.754 µs -33.928%
sized_commands_12_bytes/8000_commands 19.743 µs -33.360%
sized_commands_512_bytes/2000_commands 59.321 µs +3.5436%
sized_commands_512_bytes/4000_commands 119.70 µs +2.7667%
sized_commands_512_bytes/8000_commands 243.87 µs +2.7726%

joseph-gio avatar Jan 26 '23 21:01 joseph-gio

bors r+

alice-i-cecile avatar Jan 27 '23 17:01 alice-i-cecile

Canceled.

bors[bot] avatar Jan 27 '23 17:01 bors[bot]