bevy
bevy copied to clipboard
Replace BlobVec's swap_scratch with a swap_nonoverlapping
Objective
BlobVec currently relies on a scratch piece of memory allocated at initialization to make a temporary copy of a component when using swap_remove_and_{forget/drop}. This is potentially suboptimal as it writes to a, well-known, but random part of memory instead of using the stack.
Solution
As the FIXME in the file states, replace swap_scratch with a call to swap_nonoverlapping::<u8>. The swapped last entry is returned as a OwnedPtr.
In theory, this should be faster as the temporary swap is allocated on the stack, swap_nonoverlapping allows for easier vectorization for bigger types, and the same memory is used between the swap and the returned OwnedPtr.
hmm I was imagining just using swap_nonoverlapping and checking if len == 1 then we dont have to store a fn ptr :thinking:
swap_nonoverlapping needs a concrete type as generic parameter. As BlobVec is untyped, that still needs a function pointer.
we can use MaybeUninit<u8> and give it a length in bytes. Probably sus provenance wise but its no worse than what bevy is already doing
Without something like stackalloc or unsized locals, runtime length determined and stack allocated buffers are going to be difficult. Unfortunately that crate adds a C job to the dependency tree, which might not be desirable.
We could simply allocate a [MaybeUninit<u8>; N] on the stack and copy over multiple iterations.
What has the stack got to do with this, we can just use the memory in the blobvec cant we ?
we can just use the memory in the blobvec cant we ?
That is what was already happening before this PR. The motivation of this PR is that using the stack may be slightly faster due to cache locality reasons. I have no clue if this would indeed be the case. I think benchmarking will be needed.
Before this PR the swap scratch was a separate allocation, wheras with this PR it's part of the same buffer as the rest of the blobvec. The part where the stack comes in is how to swap two different elements on the blobvec, and for that I think a simple swap_nonoverlapping with u8 type should work.
Updated the PR to use std::ptr::swap_non_overlapping. Ran some benchmarks. Looks to be non-regressing for the affected benchmarks with a significant improvement for removing bigger components. This should also carry over to moving entities between tables for archetype moves as well.
group main yeet-swap-scratch
----- ---- -----------------
add_remove_component/sparse_set 1.01 1326.2±83.48µs ? ?/sec 1.00 1310.9±75.15µs ? ?/sec
add_remove_component/table 1.05 1667.2±95.69µs ? ?/sec 1.00 1586.3±14.45µs ? ?/sec
add_remove_component_big/sparse_set 1.03 1533.4±307.70µs ? ?/sec 1.00 1493.4±376.44µs ? ?/sec
add_remove_component_big/table 1.24 3.5±0.45ms ? ?/sec 1.00 2.8±0.21ms ? ?/sec
fake_commands/2000_commands 1.02 7.3±0.04µs ? ?/sec 1.00 7.2±0.05µs ? ?/sec
fake_commands/4000_commands 1.00 14.3±0.15µs ? ?/sec 1.02 14.5±0.17µs ? ?/sec
fake_commands/6000_commands 1.01 21.7±0.11µs ? ?/sec 1.00 21.5±0.11µs ? ?/sec
fake_commands/8000_commands 1.01 29.1±0.15µs ? ?/sec 1.00 28.9±0.14µs ? ?/sec
get_or_spawn/batched 1.00 408.2±20.63µs ? ?/sec 1.02 415.3±46.97µs ? ?/sec
get_or_spawn/individual 1.03 931.3±81.29µs ? ?/sec 1.00 904.5±91.18µs ? ?/sec
insert_commands/insert 1.02 791.9±34.36µs ? ?/sec 1.00 780.2±34.76µs ? ?/sec
insert_commands/insert_batch 1.00 403.2±48.73µs ? ?/sec 1.03 413.4±45.14µs ? ?/sec
simple_insert/base 1.00 616.8±107.65µs ? ?/sec 1.02 628.0±108.75µs ? ?/sec
simple_insert/unbatched 1.03 1367.2±80.74µs ? ?/sec 1.00 1321.9±65.69µs ? ?/sec
sized_commands_0_bytes/2000_commands 1.00 4.5±0.04µs ? ?/sec 1.00 4.5±0.07µs ? ?/sec
sized_commands_0_bytes/4000_commands 1.00 9.0±0.03µs ? ?/sec 1.00 9.1±0.04µs ? ?/sec
sized_commands_0_bytes/6000_commands 1.01 13.6±0.11µs ? ?/sec 1.00 13.5±0.08µs ? ?/sec
sized_commands_0_bytes/8000_commands 1.00 18.2±0.22µs ? ?/sec 1.00 18.2±0.48µs ? ?/sec
sized_commands_12_bytes/2000_commands 1.04 7.3±0.03µs ? ?/sec 1.00 7.0±0.19µs ? ?/sec
sized_commands_12_bytes/4000_commands 1.04 14.5±0.08µs ? ?/sec 1.00 13.9±0.52µs ? ?/sec
sized_commands_12_bytes/6000_commands 1.04 21.7±0.09µs ? ?/sec 1.00 20.9±0.10µs ? ?/sec
sized_commands_12_bytes/8000_commands 1.05 29.0±0.25µs ? ?/sec 1.00 27.6±0.23µs ? ?/sec
sized_commands_512_bytes/2000_commands 1.01 164.4±5.20µs ? ?/sec 1.00 162.6±7.88µs ? ?/sec
sized_commands_512_bytes/4000_commands 1.02 335.7±22.85µs ? ?/sec 1.00 328.9±20.53µs ? ?/sec
sized_commands_512_bytes/6000_commands 1.01 515.2±60.79µs ? ?/sec 1.00 512.1±74.67µs ? ?/sec
sized_commands_512_bytes/8000_commands 1.02 682.4±73.22µs ? ?/sec 1.00 670.4±63.50µs ? ?/sec
spawn_commands/2000_entities 1.01 269.6±26.87µs ? ?/sec 1.00 267.2±25.52µs ? ?/sec
spawn_commands/4000_entities 1.03 535.3±46.75µs ? ?/sec 1.00 521.6±28.16µs ? ?/sec
spawn_commands/6000_entities 1.02 782.7±70.84µs ? ?/sec 1.00 766.4±126.23µs ? ?/sec
spawn_commands/8000_entities 1.03 1020.1±109.07µs ? ?/sec 1.00 992.9±51.14µs ? ?/sec
On a deeper inspection, it seems like we're doing one extra copy on BlobVec moves:
- Initial
swap_remove_and_forget_uncheckedswaps with the last element int theBlobVec BlobVec::pushthen copies that last element into the target.
Ideally, we could just allocate in the target BlobVec first, get a OwningPtr targeting that element, and pass that pointer into swap_remove as a copy destination instead, eliminating the swap altogether.
bors try
I really don't want us to be adding miri exceptions this deep in the stack. Marking this as blocked on https://github.com/rust-lang/miri/issues/2181
Caved in: disabling the pointer based types in miiri tests for now via #[cfg(not(miri))], until the upstream issue is fixed with miri itself, as this issue seems to be pretty deep and likely might take a while to resolve.
Nevermind, this seems to impact pretty much every test moving data into or out of BlobVec, which involves pretty much every add/remove component test. Going to have to shelve this until that issue is resolved at least in nightly.
bors try
Revived this since there was an interest in optimizing the commands used in rendering. The components added/removed there can be quite big (140+ bytes), which lines up with some of the add/remove component benchmarks that saw a notable improvement here. Profiled this with tracy on many_cubes and saw a ~100us improvement in Prepare stage times and it directly corresponds to the improvement in commands perf.
This becomes a non-issue if #5037 is merged, as there are no heavy commands in that stage after that, but this is still a significant perf win we shouldn't ignore.
Stage timings:

Command timings:

rust-lang/rust#104054 should fix the miri issue. Once this is ready in nightly, this PR should be mergable.
bors r+
Pull request successfully merged into main.
Build succeeded:
- build-and-install-on-iOS
- build-android
- build (macos-latest)
- build (ubuntu-latest)
- build-wasm
- build (windows-latest)
- build-without-default-features (bevy)
- build-without-default-features (bevy_ecs)
- build-without-default-features (bevy_reflect)
- check-compiles
- check-doc
- check-missing-examples-in-docs
- ci
- markdownlint
- run-examples
- run-examples-on-wasm
- run-examples-on-windows-dx12