bevy icon indicating copy to clipboard operation
bevy copied to clipboard

Skip empty archetypes and tables when iterating over queries

Open james7132 opened this issue 3 years ago • 12 comments

Objective

Speed up queries that are fragmented over many empty archetypes and tables.

Solution

Add a early-out to check if the table or archetype is empty before iterating over it. This adds an extra branch for every archetype matched, but skips setting the archetype/table to the underlying state and any iteration over it.

This may not be worth it for the default Query::iter and maybe even the Query::for_each implementations, but this definitely avoids scheduling unnecessary tasks in the Query::par_for_each case.

Ideally, matched_archetypes should only contain archetypes where there's actually work to do, but this would add a O(n) flat cost to every call to update_archetypes that scales with the number of matched archetypes.

TODO: Benchmark

james7132 avatar May 11 '22 20:05 james7132

Code changes look fine to me. The proof, like always, is in the benchmark :) I'd like to see a new benchmark that looks at iteration performance vs the number of empty archetypes (0, 1k, 5k, 10k maybe?), to see where the break-even point is for this change.

alice-i-cecile avatar May 11 '22 20:05 alice-i-cecile

QueryIterationCursor needs to be updated

BoxyUwU avatar May 11 '22 20:05 BoxyUwU

QueryIterationCursor needs to be updated

Done.

james7132 avatar May 11 '22 20:05 james7132

Actual benchmark results here, and the differences here make plenty of sense. Tl;dr: it's really only beneficial to par_for_each and it's variants in the general case, otherwise the total number of empty archetypes aren't really realistic.

group                                       main                                    skip-empty
-----                                       ----                                    ----------
empty_archetypes/for_each/10                1.00      4.2±0.18µs        ? ?/sec     1.47      6.1±0.28µs        ? ?/sec
empty_archetypes/for_each/100               1.00      4.6±0.16µs        ? ?/sec     1.34      6.2±0.97µs        ? ?/sec
empty_archetypes/for_each/1000              1.00      5.6±0.12µs        ? ?/sec     1.19      6.6±0.13µs        ? ?/sec
empty_archetypes/for_each/10000             1.92     20.7±0.47µs        ? ?/sec     1.00     10.8±0.56µs        ? ?/sec
empty_archetypes/for_each/2000              1.18      8.8±0.40µs        ? ?/sec     1.00      7.5±0.55µs        ? ?/sec
empty_archetypes/for_each/500               1.00      5.0±0.19µs        ? ?/sec     1.27      6.3±0.14µs        ? ?/sec
empty_archetypes/for_each/5000              1.59     12.4±0.33µs        ? ?/sec     1.00      7.8±0.75µs        ? ?/sec
empty_archetypes/iter/10                    1.00      4.3±0.21µs        ? ?/sec     1.47      6.4±0.13µs        ? ?/sec
empty_archetypes/iter/100                   1.00      4.4±0.15µs        ? ?/sec     1.44      6.3±0.37µs        ? ?/sec
empty_archetypes/iter/1000                  1.00      5.6±0.18µs        ? ?/sec     1.19      6.6±0.07µs        ? ?/sec
empty_archetypes/iter/10000                 1.74     22.9±0.66µs        ? ?/sec     1.00     13.1±0.83µs        ? ?/sec
empty_archetypes/iter/2000                  1.22      9.1±0.40µs        ? ?/sec     1.00      7.5±0.09µs        ? ?/sec
empty_archetypes/iter/500                   1.00      5.4±0.25µs        ? ?/sec     1.20      6.5±0.13µs        ? ?/sec
empty_archetypes/iter/5000                  1.61     13.4±0.20µs        ? ?/sec     1.00      8.4±0.71µs        ? ?/sec
empty_archetypes/par_for_each/10            1.37      7.7±0.39µs        ? ?/sec     1.00      5.6±0.45µs        ? ?/sec
empty_archetypes/par_for_each/100           1.31      7.7±0.31µs        ? ?/sec     1.00      5.8±0.36µs        ? ?/sec
empty_archetypes/par_for_each/1000          1.29      8.0±0.38µs        ? ?/sec     1.00      6.2±0.71µs        ? ?/sec
empty_archetypes/par_for_each/10000         1.11     15.2±0.47µs        ? ?/sec     1.00     13.7±0.96µs        ? ?/sec
empty_archetypes/par_for_each/2000          1.38      8.5±0.37µs        ? ?/sec     1.00      6.2±0.38µs        ? ?/sec
empty_archetypes/par_for_each/500           1.39      8.4±0.33µs        ? ?/sec     1.00      6.0±0.31µs        ? ?/sec
empty_archetypes/par_for_each/5000          1.25     10.8±0.33µs        ? ?/sec     1.00      8.6±0.48µs        ? ?/sec

james7132 avatar May 12 '22 03:05 james7132

Solid work benchmarking. Those numbers make sense to me. Can we do a more targeted change here then?

alice-i-cecile avatar May 12 '22 03:05 alice-i-cecile

There were a few errors with how the prior benchmarks were run. There were 10, 100, 1000, 2000, 5000, and 10000 archetypes made, but not all of them matched the query used in the system. Once all of the created archetypes matched the given query, it seems to show that all of the iteration methods benefit from this in a tangible way. The exception here seems to be with par_for_each, where it seems like the overhead of scheduling the tasks dominates the benchmark, and are otherwise within the margin of error.

group                                  main                                   skip-empty
-----                                  ----                                   ----------
empty_archetypes/for_each/10           1.00      4.6±0.14µs        ? ?/sec    1.05      4.8±0.25µs        ? ?/sec
empty_archetypes/for_each/100          1.00      4.8±0.24µs        ? ?/sec    1.02      4.9±0.20µs        ? ?/sec
empty_archetypes/for_each/1000         1.08      5.7±0.26µs        ? ?/sec    1.00      5.3±0.21µs        ? ?/sec
empty_archetypes/for_each/10000        1.26     15.5±0.16µs        ? ?/sec    1.00     12.3±0.34µs        ? ?/sec
empty_archetypes/for_each/2000         1.20      7.3±0.33µs        ? ?/sec    1.00      6.1±0.27µs        ? ?/sec
empty_archetypes/for_each/500          1.05      5.4±0.20µs        ? ?/sec    1.00      5.1±0.20µs        ? ?/sec
empty_archetypes/for_each/5000         1.40     11.8±0.19µs        ? ?/sec    1.00      8.5±0.26µs        ? ?/sec
empty_archetypes/iter/10               1.08      4.9±0.76µs        ? ?/sec    1.00      4.5±0.18µs        ? ?/sec
empty_archetypes/iter/100              1.18      5.6±0.65µs        ? ?/sec    1.00      4.8±0.33µs        ? ?/sec
empty_archetypes/iter/1000             1.33      7.2±0.30µs        ? ?/sec    1.00      5.5±0.20µs        ? ?/sec
empty_archetypes/iter/10000            1.46     17.7±0.41µs        ? ?/sec    1.00     12.1±0.26µs        ? ?/sec
empty_archetypes/iter/2000             1.35      8.1±0.42µs        ? ?/sec    1.00      6.0±0.29µs        ? ?/sec
empty_archetypes/iter/500              1.26      6.4±0.28µs        ? ?/sec    1.00      5.1±0.22µs        ? ?/sec
empty_archetypes/iter/5000             1.42     11.9±0.15µs        ? ?/sec    1.00      8.4±0.24µs        ? ?/sec
empty_archetypes/par_for_each/10       1.00      8.0±0.38µs        ? ?/sec    1.04      8.3±0.37µs        ? ?/sec
empty_archetypes/par_for_each/100      1.00      8.0±0.32µs        ? ?/sec    1.02      8.2±0.38µs        ? ?/sec
empty_archetypes/par_for_each/1000     1.00      8.3±0.35µs        ? ?/sec    1.11      9.2±0.28µs        ? ?/sec
empty_archetypes/par_for_each/10000    1.00     14.1±0.50µs        ? ?/sec    1.00     14.1±0.47µs        ? ?/sec
empty_archetypes/par_for_each/2000     1.00      8.5±0.37µs        ? ?/sec    1.04      8.9±0.30µs        ? ?/sec

james7132 avatar May 12 '22 05:05 james7132

I dont think its right to conclude this has no effect on normal for_each. The benchmark uses Query<&A> which is going to be doing barely any work on empty archetypes. Something like Query<(&A, &A, &A, &A, &A, &A, &A, &A, &A, &A)> might make this show up in benches

BoxyUwU avatar May 12 '22 13:05 BoxyUwU

I dont think its right to conclude this has no effect on normal for_each. The benchmark uses Query<&A> which is going to be doing barely any work on empty archetypes. Something like Query<(&A, &A, &A, &A, &A, &A, &A, &A, &A, &A)> might make this show up in benches

Tried this and the results are much more pronounced. ~~I also tried filtering during update_archetypes and saw the same overhead at the low levels, but it does scale a lot more easily with higher counts.~~ Nevermind, that's unsound and we might need to find a better way to move the check out of the hot loop.

group                                  main                                   skip-empty
-----                                  ----                                   ----------
empty_archetypes/for_each/10           1.00      4.4±0.14µs        ? ?/sec    1.36      6.0±0.22µs        ? ?/sec
empty_archetypes/for_each/100          1.00      5.0±0.28µs        ? ?/sec    1.38      6.9±0.14µs        ? ?/sec
empty_archetypes/for_each/1000         1.45     10.6±0.16µs        ? ?/sec    1.00      7.3±0.07µs        ? ?/sec
empty_archetypes/for_each/10000        5.21     66.1±1.07µs        ? ?/sec    1.00     12.7±0.62µs        ? ?/sec
empty_archetypes/for_each/2000         2.03     15.4±0.25µs        ? ?/sec    1.00      7.6±0.11µs        ? ?/sec
empty_archetypes/for_each/500          1.21      7.6±0.27µs        ? ?/sec    1.00      6.3±0.17µs        ? ?/sec
empty_archetypes/for_each/5000         3.66     32.5±0.40µs        ? ?/sec    1.00      8.9±0.79µs        ? ?/sec
empty_archetypes/iter/10               1.00      4.8±0.21µs        ? ?/sec    1.32      6.3±0.12µs        ? ?/sec
empty_archetypes/iter/100              1.00      5.0±0.37µs        ? ?/sec    1.24      6.2±0.37µs        ? ?/sec
empty_archetypes/iter/1000             1.56     11.2±0.23µs        ? ?/sec    1.00      7.2±0.10µs        ? ?/sec
empty_archetypes/iter/10000            5.28     66.2±1.52µs        ? ?/sec    1.00     12.5±0.98µs        ? ?/sec
empty_archetypes/iter/2000             2.00     15.0±0.24µs        ? ?/sec    1.00      7.5±0.20µs        ? ?/sec
empty_archetypes/iter/500              1.17      7.7±0.29µs        ? ?/sec    1.00      6.6±0.12µs        ? ?/sec
empty_archetypes/iter/5000             3.87     31.9±0.24µs        ? ?/sec    1.00      8.3±0.41µs        ? ?/sec
empty_archetypes/par_for_each/10       1.36      7.9±0.52µs        ? ?/sec    1.00      5.8±0.86µs        ? ?/sec
empty_archetypes/par_for_each/100      1.31      8.0±0.43µs        ? ?/sec    1.00      6.1±0.33µs        ? ?/sec
empty_archetypes/par_for_each/1000     1.45      8.6±0.43µs        ? ?/sec    1.00      5.9±0.35µs        ? ?/sec
empty_archetypes/par_for_each/10000    1.07     15.2±0.44µs        ? ?/sec    1.00     14.2±0.89µs        ? ?/sec
empty_archetypes/par_for_each/2000     1.46      8.6±0.42µs        ? ?/sec    1.00      5.9±0.32µs        ? ?/sec
empty_archetypes/par_for_each/500      1.36      8.1±0.24µs        ? ?/sec    1.00      6.0±0.62µs        ? ?/sec
empty_archetypes/par_for_each/5000     1.25     11.5±0.32µs        ? ?/sec    1.00      9.1±0.68µs        ? ?/sec

james7132 avatar May 14 '22 07:05 james7132

Is it feasible to cull empty archetypes, or re-order archetypes so that the full ones are always at the end (i.e. so we can make the check for emptyness just be 'reached the end of the archetypes list'?

I imagine the challenge there is in ensuring that all the IDs are up-to-date, since they 'need' to be used for indexing - we'd probably need a re-mapping phase? I'm not sure how well it would even out.

DJMcNab avatar May 14 '22 15:05 DJMcNab

Is it feasible to cull empty archetypes, or re-order archetypes so that the full ones are always at the end (i.e. so we can make the check for emptyness just be 'reached the end of the archetypes list'?

This is what the "valid_archetype_ids" naively attempted to do. We'd need to find an opportune time to inject this, since it seems like only new_archetype is called on queries.

james7132 avatar May 15 '22 21:05 james7132

Revisiting this, it may make sense to add this only for the parallel case, where the work per non-empty archetype is definitively non-trivial since it needs to allocate for each task that is spawned and the overhead is significantly heavier when the task pools are under contention.

james7132 avatar Jun 16 '22 14:06 james7132

Revisiting this, it may make sense to add this only for the parallel case

I think this should be done before merging, which would make this non-controversial IMO.

alice-i-cecile avatar Sep 14 '22 03:09 alice-i-cecile

I think this should be done before merging, which would make this non-controversial IMO.

This was done before this comment. Should be ready to go.

james7132 avatar Oct 19 '22 07:10 james7132

Description could use updating now, but I'm not going to block on that.

bors r+

alice-i-cecile avatar Oct 24 '22 13:10 alice-i-cecile