bevy icon indicating copy to clipboard operation
bevy copied to clipboard

Replace BlobVec's swap_scratch with a swap_nonoverlapping

Open james7132 opened this issue 3 years ago • 19 comments

Objective

BlobVec currently relies on a scratch piece of memory allocated at initialization to make a temporary copy of a component when using swap_remove_and_{forget/drop}. This is potentially suboptimal as it writes to a, well-known, but random part of memory instead of using the stack.

Solution

As the FIXME in the file states, replace swap_scratch with a call to swap_nonoverlapping::<u8>. The swapped last entry is returned as a OwnedPtr.

In theory, this should be faster as the temporary swap is allocated on the stack, swap_nonoverlapping allows for easier vectorization for bigger types, and the same memory is used between the swap and the returned OwnedPtr.

james7132 avatar May 27 '22 02:05 james7132

hmm I was imagining just using swap_nonoverlapping and checking if len == 1 then we dont have to store a fn ptr :thinking:

BoxyUwU avatar May 28 '22 13:05 BoxyUwU

swap_nonoverlapping needs a concrete type as generic parameter. As BlobVec is untyped, that still needs a function pointer.

bjorn3 avatar May 28 '22 13:05 bjorn3

we can use MaybeUninit<u8> and give it a length in bytes. Probably sus provenance wise but its no worse than what bevy is already doing

BoxyUwU avatar May 28 '22 13:05 BoxyUwU

Without something like stackalloc or unsized locals, runtime length determined and stack allocated buffers are going to be difficult. Unfortunately that crate adds a C job to the dependency tree, which might not be desirable.

james7132 avatar May 28 '22 16:05 james7132

We could simply allocate a [MaybeUninit<u8>; N] on the stack and copy over multiple iterations.

TheRawMeatball avatar May 29 '22 12:05 TheRawMeatball

What has the stack got to do with this, we can just use the memory in the blobvec cant we ?

BoxyUwU avatar May 29 '22 13:05 BoxyUwU

we can just use the memory in the blobvec cant we ?

That is what was already happening before this PR. The motivation of this PR is that using the stack may be slightly faster due to cache locality reasons. I have no clue if this would indeed be the case. I think benchmarking will be needed.

bjorn3 avatar May 29 '22 15:05 bjorn3

Before this PR the swap scratch was a separate allocation, wheras with this PR it's part of the same buffer as the rest of the blobvec. The part where the stack comes in is how to swap two different elements on the blobvec, and for that I think a simple swap_nonoverlapping with u8 type should work.

TheRawMeatball avatar May 29 '22 17:05 TheRawMeatball

Updated the PR to use std::ptr::swap_non_overlapping. Ran some benchmarks. Looks to be non-regressing for the affected benchmarks with a significant improvement for removing bigger components. This should also carry over to moving entities between tables for archetype moves as well.

group                                                    main                                     yeet-swap-scratch
-----                                                    ----                                     -----------------
add_remove_component/sparse_set                          1.01  1326.2±83.48µs        ? ?/sec      1.00  1310.9±75.15µs        ? ?/sec
add_remove_component/table                               1.05  1667.2±95.69µs        ? ?/sec      1.00  1586.3±14.45µs        ? ?/sec
add_remove_component_big/sparse_set                      1.03  1533.4±307.70µs        ? ?/sec     1.00  1493.4±376.44µs        ? ?/sec
add_remove_component_big/table                           1.24      3.5±0.45ms        ? ?/sec      1.00      2.8±0.21ms        ? ?/sec
fake_commands/2000_commands                              1.02      7.3±0.04µs        ? ?/sec      1.00      7.2±0.05µs        ? ?/sec
fake_commands/4000_commands                              1.00     14.3±0.15µs        ? ?/sec      1.02     14.5±0.17µs        ? ?/sec
fake_commands/6000_commands                              1.01     21.7±0.11µs        ? ?/sec      1.00     21.5±0.11µs        ? ?/sec
fake_commands/8000_commands                              1.01     29.1±0.15µs        ? ?/sec      1.00     28.9±0.14µs        ? ?/sec
get_or_spawn/batched                                     1.00   408.2±20.63µs        ? ?/sec      1.02   415.3±46.97µs        ? ?/sec
get_or_spawn/individual                                  1.03   931.3±81.29µs        ? ?/sec      1.00   904.5±91.18µs        ? ?/sec
insert_commands/insert                                   1.02   791.9±34.36µs        ? ?/sec      1.00   780.2±34.76µs        ? ?/sec
insert_commands/insert_batch                             1.00   403.2±48.73µs        ? ?/sec      1.03   413.4±45.14µs        ? ?/sec
simple_insert/base                                       1.00  616.8±107.65µs        ? ?/sec      1.02  628.0±108.75µs        ? ?/sec
simple_insert/unbatched                                  1.03  1367.2±80.74µs        ? ?/sec      1.00  1321.9±65.69µs        ? ?/sec
sized_commands_0_bytes/2000_commands                     1.00      4.5±0.04µs        ? ?/sec      1.00      4.5±0.07µs        ? ?/sec
sized_commands_0_bytes/4000_commands                     1.00      9.0±0.03µs        ? ?/sec      1.00      9.1±0.04µs        ? ?/sec
sized_commands_0_bytes/6000_commands                     1.01     13.6±0.11µs        ? ?/sec      1.00     13.5±0.08µs        ? ?/sec
sized_commands_0_bytes/8000_commands                     1.00     18.2±0.22µs        ? ?/sec      1.00     18.2±0.48µs        ? ?/sec
sized_commands_12_bytes/2000_commands                    1.04      7.3±0.03µs        ? ?/sec      1.00      7.0±0.19µs        ? ?/sec
sized_commands_12_bytes/4000_commands                    1.04     14.5±0.08µs        ? ?/sec      1.00     13.9±0.52µs        ? ?/sec
sized_commands_12_bytes/6000_commands                    1.04     21.7±0.09µs        ? ?/sec      1.00     20.9±0.10µs        ? ?/sec
sized_commands_12_bytes/8000_commands                    1.05     29.0±0.25µs        ? ?/sec      1.00     27.6±0.23µs        ? ?/sec
sized_commands_512_bytes/2000_commands                   1.01    164.4±5.20µs        ? ?/sec      1.00    162.6±7.88µs        ? ?/sec
sized_commands_512_bytes/4000_commands                   1.02   335.7±22.85µs        ? ?/sec      1.00   328.9±20.53µs        ? ?/sec
sized_commands_512_bytes/6000_commands                   1.01   515.2±60.79µs        ? ?/sec      1.00   512.1±74.67µs        ? ?/sec
sized_commands_512_bytes/8000_commands                   1.02   682.4±73.22µs        ? ?/sec      1.00   670.4±63.50µs        ? ?/sec
spawn_commands/2000_entities                             1.01   269.6±26.87µs        ? ?/sec      1.00   267.2±25.52µs        ? ?/sec
spawn_commands/4000_entities                             1.03   535.3±46.75µs        ? ?/sec      1.00   521.6±28.16µs        ? ?/sec
spawn_commands/6000_entities                             1.02   782.7±70.84µs        ? ?/sec      1.00  766.4±126.23µs        ? ?/sec
spawn_commands/8000_entities                             1.03  1020.1±109.07µs        ? ?/sec     1.00   992.9±51.14µs        ? ?/sec

james7132 avatar Jun 01 '22 11:06 james7132

On a deeper inspection, it seems like we're doing one extra copy on BlobVec moves:

  • Initial swap_remove_and_forget_unchecked swaps with the last element int the BlobVec
  • BlobVec::push then copies that last element into the target.

Ideally, we could just allocate in the target BlobVec first, get a OwningPtr targeting that element, and pass that pointer into swap_remove as a copy destination instead, eliminating the swap altogether.

james7132 avatar Jun 03 '22 08:06 james7132

bors try

james7132 avatar Jun 04 '22 13:06 james7132

try

Build failed:

bors[bot] avatar Jun 04 '22 13:06 bors[bot]

I really don't want us to be adding miri exceptions this deep in the stack. Marking this as blocked on https://github.com/rust-lang/miri/issues/2181

james7132 avatar Jun 04 '22 14:06 james7132

Caved in: disabling the pointer based types in miiri tests for now via #[cfg(not(miri))], until the upstream issue is fixed with miri itself, as this issue seems to be pretty deep and likely might take a while to resolve.

james7132 avatar Jun 06 '22 07:06 james7132

Nevermind, this seems to impact pretty much every test moving data into or out of BlobVec, which involves pretty much every add/remove component test. Going to have to shelve this until that issue is resolved at least in nightly.

james7132 avatar Jun 06 '22 07:06 james7132

bors try

james7132 avatar Jun 11 '22 22:06 james7132

try

Build failed:

bors[bot] avatar Jun 11 '22 22:06 bors[bot]

Revived this since there was an interest in optimizing the commands used in rendering. The components added/removed there can be quite big (140+ bytes), which lines up with some of the add/remove component benchmarks that saw a notable improvement here. Profiled this with tracy on many_cubes and saw a ~100us improvement in Prepare stage times and it directly corresponds to the improvement in commands perf.

This becomes a non-issue if #5037 is merged, as there are no heavy commands in that stage after that, but this is still a significant perf win we shouldn't ignore.

Stage timings:

image

Command timings:

image

james7132 avatar Jun 18 '22 02:06 james7132

rust-lang/rust#104054 should fix the miri issue. Once this is ready in nightly, this PR should be mergable.

james7132 avatar Nov 14 '22 11:11 james7132

bors r+

cart avatar Nov 16 '22 20:11 cart