Fix query transmute from table to archetype iteration unsoundness
Objective
- Fixes #14348
- Fixes #14528
- Less complex (but also likely less performant) alternative to #14611
Solution
- Add a
is_densefield flag toQueryIterindicating whether it is dense or not, that is whether it can perform dense iteration or not; - Check this flag any time iteration over a query is performed.
It would be nice if someone could try benching this change to see if it actually matters.
~Note that this not 100% ready for mergin, since there are a bunch of safety comments on the use of the various IS_DENSE for checks that still need to be updated.~ This is ready modulo benchmarks
I finally took the time to do a couple of benchmarks. My machine is pretty noisy, so don't expect them to be very accurate. It does show some small percent slowdown here and there, though they don't seem too bad, especially considering these are synthetic benchmarks.
group main pr
----- ---- --
heavy_compute/base 1.00 248.9±4.47µs ? ?/sec 1.01 251.2±6.62µs ? ?/sec
iter_fragmented(4096)_empty/foreach_sparse 1.00 12.6±0.33µs ? ?/sec 1.00 12.6±0.22µs ? ?/sec
iter_fragmented(4096)_empty/foreach_table 1.00 3.9±0.11µs ? ?/sec 1.01 3.9±0.12µs ? ?/sec
iter_fragmented/base 1.00 430.1±7.49ns ? ?/sec 1.00 428.1±4.77ns ? ?/sec
iter_fragmented/foreach 1.00 352.3±9.03ns ? ?/sec 1.00 352.9±8.20ns ? ?/sec
iter_fragmented/foreach_wide 1.00 3.5±0.06µs ? ?/sec 1.00 3.5±0.10µs ? ?/sec
iter_fragmented/wide 1.00 3.6±0.07µs ? ?/sec 1.00 3.6±0.09µs ? ?/sec
iter_fragmented_sparse/base 1.00 11.7±0.16ns ? ?/sec 1.00 11.6±0.17ns ? ?/sec
iter_fragmented_sparse/foreach 1.00 11.7±0.23ns ? ?/sec 1.00 11.6±0.28ns ? ?/sec
iter_fragmented_sparse/foreach_wide 1.01 69.2±0.97ns ? ?/sec 1.00 68.8±0.57ns ? ?/sec
iter_fragmented_sparse/wide 1.00 46.5±1.69ns ? ?/sec 1.00 46.5±1.47ns ? ?/sec
iter_simple/base 1.00 10.0±0.14µs ? ?/sec 1.00 9.9±0.13µs ? ?/sec
iter_simple/foreach 1.00 2.3±0.04µs ? ?/sec 1.08 2.4±0.27µs ? ?/sec
iter_simple/foreach_hybrid 1.01 11.4±0.25µs ? ?/sec 1.00 11.3±0.23µs ? ?/sec
iter_simple/foreach_sparse_set 1.00 22.4±3.90µs ? ?/sec 1.05 23.6±5.20µs ? ?/sec
iter_simple/foreach_wide 1.00 11.9±0.64µs ? ?/sec 1.05 12.5±0.50µs ? ?/sec
iter_simple/foreach_wide_sparse_set 1.00 113.8±1.59µs ? ?/sec 1.00 114.1±2.20µs ? ?/sec
iter_simple/sparse_set 1.00 24.2±4.03µs ? ?/sec 1.05 25.3±5.05µs ? ?/sec
iter_simple/system 1.01 9.9±0.18µs ? ?/sec 1.00 9.9±0.13µs ? ?/sec
iter_simple/wide 1.00 45.7±0.63µs ? ?/sec 1.00 45.5±0.86µs ? ?/sec
iter_simple/wide_sparse_set 1.01 121.1±2.03µs ? ?/sec 1.00 120.4±3.74µs ? ?/sec
par_iter_simple/with_0_fragment 1.00 35.2±3.12µs ? ?/sec 1.01 35.7±3.74µs ? ?/sec
par_iter_simple/with_1000_fragment 1.00 49.6±5.58µs ? ?/sec 1.02 50.8±5.19µs ? ?/sec
par_iter_simple/with_100_fragment 1.01 36.6±4.09µs ? ?/sec 1.00 36.3±3.03µs ? ?/sec
par_iter_simple/with_10_fragment 1.00 36.2±3.47µs ? ?/sec 1.00 36.1±3.48µs ? ?/sec
The for_each slowdowns are concerning. Might not be autovectorizing anymore.
The for_each slowdowns are concerning. Might not be autovectorizing anymore.
10% more like a noise, if Auto-vectorization no longer works ,the loss may reach more than 50% ,I did some profiling ,it seems still works(windows).
maybe rust compiler already gets the ability to optimize this case.
here is my benchmark
The for_each slowdowns are concerning. Might not be autovectorizing anymore.
I would expect them to be mainly noise since the changes don't really affect the inner loop over tables, which is the one that should be autovectorized. If there's any slowdown I would expect it to be for benchmarks that use plain for loops.