bevy
bevy copied to clipboard
Skip empty archetypes and tables when iterating over queries
Objective
Speed up queries that are fragmented over many empty archetypes and tables.
Solution
Add a early-out to check if the table or archetype is empty before iterating over it. This adds an extra branch for every archetype matched, but skips setting the archetype/table to the underlying state and any iteration over it.
This may not be worth it for the default Query::iter and maybe even the Query::for_each implementations, but this definitely avoids scheduling unnecessary tasks in the Query::par_for_each case.
Ideally, matched_archetypes should only contain archetypes where there's actually work to do, but this would add a O(n) flat cost to every call to update_archetypes that scales with the number of matched archetypes.
TODO: Benchmark
Code changes look fine to me. The proof, like always, is in the benchmark :) I'd like to see a new benchmark that looks at iteration performance vs the number of empty archetypes (0, 1k, 5k, 10k maybe?), to see where the break-even point is for this change.
QueryIterationCursor needs to be updated
QueryIterationCursorneeds to be updated
Done.
Actual benchmark results here, and the differences here make plenty of sense. Tl;dr: it's really only beneficial to par_for_each and it's variants in the general case, otherwise the total number of empty archetypes aren't really realistic.
group main skip-empty
----- ---- ----------
empty_archetypes/for_each/10 1.00 4.2±0.18µs ? ?/sec 1.47 6.1±0.28µs ? ?/sec
empty_archetypes/for_each/100 1.00 4.6±0.16µs ? ?/sec 1.34 6.2±0.97µs ? ?/sec
empty_archetypes/for_each/1000 1.00 5.6±0.12µs ? ?/sec 1.19 6.6±0.13µs ? ?/sec
empty_archetypes/for_each/10000 1.92 20.7±0.47µs ? ?/sec 1.00 10.8±0.56µs ? ?/sec
empty_archetypes/for_each/2000 1.18 8.8±0.40µs ? ?/sec 1.00 7.5±0.55µs ? ?/sec
empty_archetypes/for_each/500 1.00 5.0±0.19µs ? ?/sec 1.27 6.3±0.14µs ? ?/sec
empty_archetypes/for_each/5000 1.59 12.4±0.33µs ? ?/sec 1.00 7.8±0.75µs ? ?/sec
empty_archetypes/iter/10 1.00 4.3±0.21µs ? ?/sec 1.47 6.4±0.13µs ? ?/sec
empty_archetypes/iter/100 1.00 4.4±0.15µs ? ?/sec 1.44 6.3±0.37µs ? ?/sec
empty_archetypes/iter/1000 1.00 5.6±0.18µs ? ?/sec 1.19 6.6±0.07µs ? ?/sec
empty_archetypes/iter/10000 1.74 22.9±0.66µs ? ?/sec 1.00 13.1±0.83µs ? ?/sec
empty_archetypes/iter/2000 1.22 9.1±0.40µs ? ?/sec 1.00 7.5±0.09µs ? ?/sec
empty_archetypes/iter/500 1.00 5.4±0.25µs ? ?/sec 1.20 6.5±0.13µs ? ?/sec
empty_archetypes/iter/5000 1.61 13.4±0.20µs ? ?/sec 1.00 8.4±0.71µs ? ?/sec
empty_archetypes/par_for_each/10 1.37 7.7±0.39µs ? ?/sec 1.00 5.6±0.45µs ? ?/sec
empty_archetypes/par_for_each/100 1.31 7.7±0.31µs ? ?/sec 1.00 5.8±0.36µs ? ?/sec
empty_archetypes/par_for_each/1000 1.29 8.0±0.38µs ? ?/sec 1.00 6.2±0.71µs ? ?/sec
empty_archetypes/par_for_each/10000 1.11 15.2±0.47µs ? ?/sec 1.00 13.7±0.96µs ? ?/sec
empty_archetypes/par_for_each/2000 1.38 8.5±0.37µs ? ?/sec 1.00 6.2±0.38µs ? ?/sec
empty_archetypes/par_for_each/500 1.39 8.4±0.33µs ? ?/sec 1.00 6.0±0.31µs ? ?/sec
empty_archetypes/par_for_each/5000 1.25 10.8±0.33µs ? ?/sec 1.00 8.6±0.48µs ? ?/sec
Solid work benchmarking. Those numbers make sense to me. Can we do a more targeted change here then?
There were a few errors with how the prior benchmarks were run. There were 10, 100, 1000, 2000, 5000, and 10000 archetypes made, but not all of them matched the query used in the system. Once all of the created archetypes matched the given query, it seems to show that all of the iteration methods benefit from this in a tangible way. The exception here seems to be with par_for_each, where it seems like the overhead of scheduling the tasks dominates the benchmark, and are otherwise within the margin of error.
group main skip-empty
----- ---- ----------
empty_archetypes/for_each/10 1.00 4.6±0.14µs ? ?/sec 1.05 4.8±0.25µs ? ?/sec
empty_archetypes/for_each/100 1.00 4.8±0.24µs ? ?/sec 1.02 4.9±0.20µs ? ?/sec
empty_archetypes/for_each/1000 1.08 5.7±0.26µs ? ?/sec 1.00 5.3±0.21µs ? ?/sec
empty_archetypes/for_each/10000 1.26 15.5±0.16µs ? ?/sec 1.00 12.3±0.34µs ? ?/sec
empty_archetypes/for_each/2000 1.20 7.3±0.33µs ? ?/sec 1.00 6.1±0.27µs ? ?/sec
empty_archetypes/for_each/500 1.05 5.4±0.20µs ? ?/sec 1.00 5.1±0.20µs ? ?/sec
empty_archetypes/for_each/5000 1.40 11.8±0.19µs ? ?/sec 1.00 8.5±0.26µs ? ?/sec
empty_archetypes/iter/10 1.08 4.9±0.76µs ? ?/sec 1.00 4.5±0.18µs ? ?/sec
empty_archetypes/iter/100 1.18 5.6±0.65µs ? ?/sec 1.00 4.8±0.33µs ? ?/sec
empty_archetypes/iter/1000 1.33 7.2±0.30µs ? ?/sec 1.00 5.5±0.20µs ? ?/sec
empty_archetypes/iter/10000 1.46 17.7±0.41µs ? ?/sec 1.00 12.1±0.26µs ? ?/sec
empty_archetypes/iter/2000 1.35 8.1±0.42µs ? ?/sec 1.00 6.0±0.29µs ? ?/sec
empty_archetypes/iter/500 1.26 6.4±0.28µs ? ?/sec 1.00 5.1±0.22µs ? ?/sec
empty_archetypes/iter/5000 1.42 11.9±0.15µs ? ?/sec 1.00 8.4±0.24µs ? ?/sec
empty_archetypes/par_for_each/10 1.00 8.0±0.38µs ? ?/sec 1.04 8.3±0.37µs ? ?/sec
empty_archetypes/par_for_each/100 1.00 8.0±0.32µs ? ?/sec 1.02 8.2±0.38µs ? ?/sec
empty_archetypes/par_for_each/1000 1.00 8.3±0.35µs ? ?/sec 1.11 9.2±0.28µs ? ?/sec
empty_archetypes/par_for_each/10000 1.00 14.1±0.50µs ? ?/sec 1.00 14.1±0.47µs ? ?/sec
empty_archetypes/par_for_each/2000 1.00 8.5±0.37µs ? ?/sec 1.04 8.9±0.30µs ? ?/sec
I dont think its right to conclude this has no effect on normal for_each. The benchmark uses Query<&A> which is going to be doing barely any work on empty archetypes. Something like Query<(&A, &A, &A, &A, &A, &A, &A, &A, &A, &A)> might make this show up in benches
I dont think its right to conclude this has no effect on normal
for_each. The benchmark usesQuery<&A>which is going to be doing barely any work on empty archetypes. Something likeQuery<(&A, &A, &A, &A, &A, &A, &A, &A, &A, &A)>might make this show up in benches
Tried this and the results are much more pronounced. ~~I also tried filtering during update_archetypes and saw the same overhead at the low levels, but it does scale a lot more easily with higher counts.~~ Nevermind, that's unsound and we might need to find a better way to move the check out of the hot loop.
group main skip-empty
----- ---- ----------
empty_archetypes/for_each/10 1.00 4.4±0.14µs ? ?/sec 1.36 6.0±0.22µs ? ?/sec
empty_archetypes/for_each/100 1.00 5.0±0.28µs ? ?/sec 1.38 6.9±0.14µs ? ?/sec
empty_archetypes/for_each/1000 1.45 10.6±0.16µs ? ?/sec 1.00 7.3±0.07µs ? ?/sec
empty_archetypes/for_each/10000 5.21 66.1±1.07µs ? ?/sec 1.00 12.7±0.62µs ? ?/sec
empty_archetypes/for_each/2000 2.03 15.4±0.25µs ? ?/sec 1.00 7.6±0.11µs ? ?/sec
empty_archetypes/for_each/500 1.21 7.6±0.27µs ? ?/sec 1.00 6.3±0.17µs ? ?/sec
empty_archetypes/for_each/5000 3.66 32.5±0.40µs ? ?/sec 1.00 8.9±0.79µs ? ?/sec
empty_archetypes/iter/10 1.00 4.8±0.21µs ? ?/sec 1.32 6.3±0.12µs ? ?/sec
empty_archetypes/iter/100 1.00 5.0±0.37µs ? ?/sec 1.24 6.2±0.37µs ? ?/sec
empty_archetypes/iter/1000 1.56 11.2±0.23µs ? ?/sec 1.00 7.2±0.10µs ? ?/sec
empty_archetypes/iter/10000 5.28 66.2±1.52µs ? ?/sec 1.00 12.5±0.98µs ? ?/sec
empty_archetypes/iter/2000 2.00 15.0±0.24µs ? ?/sec 1.00 7.5±0.20µs ? ?/sec
empty_archetypes/iter/500 1.17 7.7±0.29µs ? ?/sec 1.00 6.6±0.12µs ? ?/sec
empty_archetypes/iter/5000 3.87 31.9±0.24µs ? ?/sec 1.00 8.3±0.41µs ? ?/sec
empty_archetypes/par_for_each/10 1.36 7.9±0.52µs ? ?/sec 1.00 5.8±0.86µs ? ?/sec
empty_archetypes/par_for_each/100 1.31 8.0±0.43µs ? ?/sec 1.00 6.1±0.33µs ? ?/sec
empty_archetypes/par_for_each/1000 1.45 8.6±0.43µs ? ?/sec 1.00 5.9±0.35µs ? ?/sec
empty_archetypes/par_for_each/10000 1.07 15.2±0.44µs ? ?/sec 1.00 14.2±0.89µs ? ?/sec
empty_archetypes/par_for_each/2000 1.46 8.6±0.42µs ? ?/sec 1.00 5.9±0.32µs ? ?/sec
empty_archetypes/par_for_each/500 1.36 8.1±0.24µs ? ?/sec 1.00 6.0±0.62µs ? ?/sec
empty_archetypes/par_for_each/5000 1.25 11.5±0.32µs ? ?/sec 1.00 9.1±0.68µs ? ?/sec
Is it feasible to cull empty archetypes, or re-order archetypes so that the full ones are always at the end (i.e. so we can make the check for emptyness just be 'reached the end of the archetypes list'?
I imagine the challenge there is in ensuring that all the IDs are up-to-date, since they 'need' to be used for indexing - we'd probably need a re-mapping phase? I'm not sure how well it would even out.
Is it feasible to cull empty archetypes, or re-order archetypes so that the full ones are always at the end (i.e. so we can make the check for emptyness just be 'reached the end of the archetypes list'?
This is what the "valid_archetype_ids" naively attempted to do. We'd need to find an opportune time to inject this, since it seems like only new_archetype is called on queries.
Revisiting this, it may make sense to add this only for the parallel case, where the work per non-empty archetype is definitively non-trivial since it needs to allocate for each task that is spawned and the overhead is significantly heavier when the task pools are under contention.
Revisiting this, it may make sense to add this only for the parallel case
I think this should be done before merging, which would make this non-controversial IMO.
I think this should be done before merging, which would make this non-controversial IMO.
This was done before this comment. Should be ready to go.
Description could use updating now, but I'm not going to block on that.
bors r+
Pull request successfully merged into main.
Build succeeded: