bevy icon indicating copy to clipboard operation
bevy copied to clipboard

Fix query transmute from table to archetype iteration unsoundness

Open SkiFire13 opened this issue 1 year ago • 4 comments

Objective

  • Fixes #14348
  • Fixes #14528
  • Less complex (but also likely less performant) alternative to #14611

Solution

  • Add a is_dense field flag to QueryIter indicating whether it is dense or not, that is whether it can perform dense iteration or not;
  • Check this flag any time iteration over a query is performed.

It would be nice if someone could try benching this change to see if it actually matters.

~Note that this not 100% ready for mergin, since there are a bunch of safety comments on the use of the various IS_DENSE for checks that still need to be updated.~ This is ready modulo benchmarks

SkiFire13 avatar Aug 04 '24 20:08 SkiFire13

I finally took the time to do a couple of benchmarks. My machine is pretty noisy, so don't expect them to be very accurate. It does show some small percent slowdown here and there, though they don't seem too bad, especially considering these are synthetic benchmarks.

group                                         main                                   pr
-----                                         ----                                   --
heavy_compute/base                            1.00    248.9±4.47µs        ? ?/sec    1.01    251.2±6.62µs        ? ?/sec
iter_fragmented(4096)_empty/foreach_sparse    1.00     12.6±0.33µs        ? ?/sec    1.00     12.6±0.22µs        ? ?/sec
iter_fragmented(4096)_empty/foreach_table     1.00      3.9±0.11µs        ? ?/sec    1.01      3.9±0.12µs        ? ?/sec
iter_fragmented/base                          1.00    430.1±7.49ns        ? ?/sec    1.00    428.1±4.77ns        ? ?/sec
iter_fragmented/foreach                       1.00    352.3±9.03ns        ? ?/sec    1.00    352.9±8.20ns        ? ?/sec
iter_fragmented/foreach_wide                  1.00      3.5±0.06µs        ? ?/sec    1.00      3.5±0.10µs        ? ?/sec
iter_fragmented/wide                          1.00      3.6±0.07µs        ? ?/sec    1.00      3.6±0.09µs        ? ?/sec
iter_fragmented_sparse/base                   1.00     11.7±0.16ns        ? ?/sec    1.00     11.6±0.17ns        ? ?/sec
iter_fragmented_sparse/foreach                1.00     11.7±0.23ns        ? ?/sec    1.00     11.6±0.28ns        ? ?/sec
iter_fragmented_sparse/foreach_wide           1.01     69.2±0.97ns        ? ?/sec    1.00     68.8±0.57ns        ? ?/sec
iter_fragmented_sparse/wide                   1.00     46.5±1.69ns        ? ?/sec    1.00     46.5±1.47ns        ? ?/sec
iter_simple/base                              1.00     10.0±0.14µs        ? ?/sec    1.00      9.9±0.13µs        ? ?/sec
iter_simple/foreach                           1.00      2.3±0.04µs        ? ?/sec    1.08      2.4±0.27µs        ? ?/sec
iter_simple/foreach_hybrid                    1.01     11.4±0.25µs        ? ?/sec    1.00     11.3±0.23µs        ? ?/sec
iter_simple/foreach_sparse_set                1.00     22.4±3.90µs        ? ?/sec    1.05     23.6±5.20µs        ? ?/sec
iter_simple/foreach_wide                      1.00     11.9±0.64µs        ? ?/sec    1.05     12.5±0.50µs        ? ?/sec
iter_simple/foreach_wide_sparse_set           1.00    113.8±1.59µs        ? ?/sec    1.00    114.1±2.20µs        ? ?/sec
iter_simple/sparse_set                        1.00     24.2±4.03µs        ? ?/sec    1.05     25.3±5.05µs        ? ?/sec
iter_simple/system                            1.01      9.9±0.18µs        ? ?/sec    1.00      9.9±0.13µs        ? ?/sec
iter_simple/wide                              1.00     45.7±0.63µs        ? ?/sec    1.00     45.5±0.86µs        ? ?/sec
iter_simple/wide_sparse_set                   1.01    121.1±2.03µs        ? ?/sec    1.00    120.4±3.74µs        ? ?/sec
par_iter_simple/with_0_fragment               1.00     35.2±3.12µs        ? ?/sec    1.01     35.7±3.74µs        ? ?/sec
par_iter_simple/with_1000_fragment            1.00     49.6±5.58µs        ? ?/sec    1.02     50.8±5.19µs        ? ?/sec
par_iter_simple/with_100_fragment             1.01     36.6±4.09µs        ? ?/sec    1.00     36.3±3.03µs        ? ?/sec
par_iter_simple/with_10_fragment              1.00     36.2±3.47µs        ? ?/sec    1.00     36.1±3.48µs        ? ?/sec

SkiFire13 avatar Aug 19 '24 16:08 SkiFire13

The for_each slowdowns are concerning. Might not be autovectorizing anymore.

hymm avatar Aug 19 '24 16:08 hymm

The for_each slowdowns are concerning. Might not be autovectorizing anymore.

10% more like a noise, if Auto-vectorization no longer works ,the loss may reach more than 50% ,I did some profiling ,it seems still works(windows).

maybe rust compiler already gets the ability to optimize this case.

here is my benchmark image

re0312 avatar Aug 19 '24 18:08 re0312

The for_each slowdowns are concerning. Might not be autovectorizing anymore.

I would expect them to be mainly noise since the changes don't really affect the inner loop over tables, which is the one that should be autovectorized. If there's any slowdown I would expect it to be for benchmarks that use plain for loops.

SkiFire13 avatar Aug 20 '24 11:08 SkiFire13