bevy icon indicating copy to clipboard operation
bevy copied to clipboard

Split Component Ticks

Open james7132 opened this issue 3 years ago • 7 comments

Objective

Fixes #4884. ComponentTicks stores both added and changed ticks contiguously in the same 8 bytes. This is convenient when passing around both together, but causes half the bytes fetched from memory for the purposes of change detection to effectively go unused. This is inefficient when most queries (no filter, mutating something) only write out to the changed ticks.

Solution

Split the storage for change detection ticks into two separate Vecs inside Column. Fetch only what is needed during iteration.

This also potentially also removes one blocker from autovectorization of dense queries.

EDIT: This is confirmed to enable autovectorization of dense queries in for_each and par_for_each where possible. Unfortunately iter has other blockers that prevent it.

TODO

  • [x] Microbenchmark
  • [x] Check if this allows query iteration to autovectorize simple loops.
  • [x] Clean up all of the spurious tuples now littered throughout the API

Open Questions

  • ~~Is Mut::is_added absolutely necessary? Can we not just use Added or ChangeTrackers?~~ It's optimized out if unused.
  • ~~Does the fetch of the added ticks get optimized out if not used?~~ Yes it is.

Changelog

Added: Tick, a wrapper around a single change detection tick. Added: Column::get_added_ticks Added: Column::get_column_ticks Added: SparseSet::get_added_ticks Added: SparseSet::get_column_ticks Changed: Column now stores added and changed ticks separately internally. Changed: Most APIs returning &UnsafeCell<ComponentTicks> now returns TickCells instead, which contains two separate &UnsafeCell<Tick> for either component ticks. Changed: Query::for_each(_mut), Query::par_for_each(_mut) will now leverage autovectorization to speed up query iteration where possible.

Migration Guide

TODO

james7132 avatar Nov 11 '22 09:11 james7132

Quick answers:

  • Bad news: No this does not allow auto-vectorization.
  • Good news: the added ticks are not included in the final output if unused.

Confirmed by checking the same test code's output used in #6461. It's identical save for the offset calculation on the movl instruction:

This PR

.LBB1_5:
	movss	(%rdi,%rbx,4), %xmm0
	movl	%r8d, (%rsi,%rbx,4)
	addss	(%r11,%rbx,4), %xmm0
	movss	%xmm0, (%r11,%rbx,4)
	incq	%rbx
.LBB1_1:
	cmpq	%rax, %rbx
	jne	.LBB1_5
	.p2align	4, 0x90

The output from #6461

.LBB5_5:
	movss	(%rbx,%rbp,4), %xmm0
	movl	%r14d, 4(%r8,%rbp,8)
	addss	(%rdi,%rbp,4), %xmm0
	movss	%xmm0, (%rdi,%rbp,4)
	incq	%rbp
.LBB5_1:
	cmpq	%rdx, %rbp
	jne	.LBB5_5
	.p2align	4, 0x90

james7132 avatar Nov 11 '22 10:11 james7132

Finalized a set of microbenchmarks. Looks like there's a massive improvement in iteration times where there are mutable queries, slight improvements in overall change detection, and small regressions in commands.

group                                                             main                                     split-ticks
-----                                                             ----                                     -----------
add_remove/sparse_set                                             1.00   995.6±74.34µs        ? ?/sec      1.00   998.1±62.41µs        ? ?/sec
add_remove/table                                                  1.00  1260.4±10.96µs        ? ?/sec      1.08  1367.5±26.72µs        ? ?/sec
add_remove_big/sparse_set                                         1.02  1181.0±319.71µs        ? ?/sec     1.00  1158.4±281.59µs        ? ?/sec
add_remove_big/table                                              1.00      2.4±0.03ms        ? ?/sec      1.12      2.7±0.03ms        ? ?/sec
all_added_detection/50000_entities_change_detection::Sparse       1.00   129.8±15.35µs        ? ?/sec      1.00   129.4±22.14µs        ? ?/sec
all_added_detection/50000_entities_change_detection::Table        1.00   101.4±18.29µs        ? ?/sec      1.13   114.1±18.04µs        ? ?/sec
all_added_detection/60000_entities_change_detection::Sparse       1.23   237.5±29.61µs        ? ?/sec      1.00   193.6±46.10µs        ? ?/sec
all_added_detection/60000_entities_change_detection::Table        1.00    144.1±8.31µs        ? ?/sec      1.06    152.8±4.73µs        ? ?/sec
all_changed_detection/50000_entities_change_detection::Sparse     1.01   135.0±31.82µs        ? ?/sec      1.00   133.8±30.77µs        ? ?/sec
all_changed_detection/50000_entities_change_detection::Table      1.03     97.1±4.01µs        ? ?/sec      1.00     93.9±4.29µs        ? ?/sec
all_changed_detection/60000_entities_change_detection::Sparse     1.06   187.0±35.59µs        ? ?/sec      1.00   177.1±31.78µs        ? ?/sec
all_changed_detection/60000_entities_change_detection::Table      1.00    138.4±3.56µs        ? ?/sec      1.00    138.5±3.66µs        ? ?/sec
busy_systems/01x_entities_03_systems                              1.32     31.6±0.88µs        ? ?/sec      1.00     24.0±0.91µs        ? ?/sec
busy_systems/01x_entities_06_systems                              1.25     70.6±3.67µs        ? ?/sec      1.00     56.6±4.68µs        ? ?/sec
busy_systems/01x_entities_09_systems                              1.56    127.0±7.21µs        ? ?/sec      1.00     81.5±2.37µs        ? ?/sec
busy_systems/01x_entities_12_systems                              1.57    153.1±6.28µs        ? ?/sec      1.00     97.7±2.90µs        ? ?/sec
busy_systems/01x_entities_15_systems                              1.60   186.4±12.49µs        ? ?/sec      1.00    116.8±3.90µs        ? ?/sec
busy_systems/02x_entities_03_systems                              1.53     73.2±4.99µs        ? ?/sec      1.00     47.8±2.15µs        ? ?/sec
busy_systems/02x_entities_06_systems                              1.95   163.1±14.54µs        ? ?/sec      1.00     83.6±4.61µs        ? ?/sec
busy_systems/02x_entities_09_systems                              1.53   207.6±16.86µs        ? ?/sec      1.00    135.9±6.00µs        ? ?/sec
busy_systems/02x_entities_12_systems                              1.74   287.4±16.27µs        ? ?/sec      1.00    165.1±4.98µs        ? ?/sec
busy_systems/02x_entities_15_systems                              1.69   329.7±14.64µs        ? ?/sec      1.00    195.3±5.06µs        ? ?/sec
busy_systems/03x_entities_03_systems                              1.48    101.5±6.28µs        ? ?/sec      1.00     68.7±2.51µs        ? ?/sec
busy_systems/03x_entities_06_systems                              1.59   205.8±15.66µs        ? ?/sec      1.00    129.4±8.53µs        ? ?/sec
busy_systems/03x_entities_09_systems                              1.49   308.8±15.80µs        ? ?/sec      1.00   207.7±11.45µs        ? ?/sec
busy_systems/03x_entities_12_systems                              1.68   405.0±16.62µs        ? ?/sec      1.00   240.8±10.02µs        ? ?/sec
busy_systems/03x_entities_15_systems                              1.75   514.0±15.58µs        ? ?/sec      1.00    294.0±9.37µs        ? ?/sec
busy_systems/04x_entities_03_systems                              1.69    122.3±6.27µs        ? ?/sec      1.00     72.2±3.99µs        ? ?/sec
busy_systems/04x_entities_06_systems                              1.49   305.5±18.09µs        ? ?/sec      1.00   205.1±13.88µs        ? ?/sec
busy_systems/04x_entities_09_systems                              2.06   423.4±20.03µs        ? ?/sec      1.00    205.1±7.81µs        ? ?/sec
busy_systems/04x_entities_12_systems                              1.67   537.6±23.43µs        ? ?/sec      1.00   321.1±16.07µs        ? ?/sec
busy_systems/04x_entities_15_systems                              1.85   669.0±21.67µs        ? ?/sec      1.00   360.7±13.11µs        ? ?/sec
busy_systems/05x_entities_03_systems                              1.36   147.0±10.16µs        ? ?/sec      1.00    107.9±6.56µs        ? ?/sec
busy_systems/05x_entities_06_systems                              1.57   374.1±23.16µs        ? ?/sec      1.00   238.2±14.32µs        ? ?/sec
busy_systems/05x_entities_09_systems                              1.63   507.2±32.37µs        ? ?/sec      1.00   312.0±11.55µs        ? ?/sec
busy_systems/05x_entities_12_systems                              2.07   709.1±32.12µs        ? ?/sec      1.00   342.8±12.35µs        ? ?/sec
busy_systems/05x_entities_15_systems                              1.97   810.0±41.95µs        ? ?/sec      1.00   410.4±10.81µs        ? ?/sec
contrived/01x_entities_03_systems                                 1.37     27.4±0.55µs        ? ?/sec      1.00     19.9±0.55µs        ? ?/sec
contrived/01x_entities_06_systems                                 1.37     47.3±1.35µs        ? ?/sec      1.00     34.6±1.24µs        ? ?/sec
contrived/01x_entities_09_systems                                 1.43     71.5±1.43µs        ? ?/sec      1.00     50.1±1.07µs        ? ?/sec
contrived/01x_entities_12_systems                                 1.40     93.1±1.60µs        ? ?/sec      1.00     66.6±1.62µs        ? ?/sec
contrived/01x_entities_15_systems                                 1.44    117.1±1.94µs        ? ?/sec      1.00     81.1±2.39µs        ? ?/sec
contrived/02x_entities_03_systems                                 1.47     40.4±1.50µs        ? ?/sec      1.00     27.5±0.73µs        ? ?/sec
contrived/02x_entities_06_systems                                 1.31     79.2±2.25µs        ? ?/sec      1.00     60.6±1.45µs        ? ?/sec
contrived/02x_entities_09_systems                                 1.29    109.7±2.42µs        ? ?/sec      1.00     85.2±2.15µs        ? ?/sec
contrived/02x_entities_12_systems                                 1.28    145.2±3.18µs        ? ?/sec      1.00    113.2±1.40µs        ? ?/sec
contrived/02x_entities_15_systems                                 1.29    180.4±3.33µs        ? ?/sec      1.00    139.7±2.35µs        ? ?/sec
contrived/03x_entities_03_systems                                 1.19     56.7±2.67µs        ? ?/sec      1.00     47.7±0.88µs        ? ?/sec
contrived/03x_entities_06_systems                                 1.12    108.3±6.38µs        ? ?/sec      1.00     97.0±1.81µs        ? ?/sec
contrived/03x_entities_09_systems                                 1.12    157.2±3.58µs        ? ?/sec      1.00    139.7±3.22µs        ? ?/sec
contrived/03x_entities_12_systems                                 1.08    203.1±7.54µs        ? ?/sec      1.00    188.1±3.33µs        ? ?/sec
contrived/03x_entities_15_systems                                 1.12    261.2±4.55µs        ? ?/sec      1.00    233.2±4.01µs        ? ?/sec
contrived/04x_entities_03_systems                                 1.08     67.3±1.03µs        ? ?/sec      1.00     62.6±0.89µs        ? ?/sec
contrived/04x_entities_06_systems                                 1.11    140.5±2.89µs        ? ?/sec      1.00    126.2±2.04µs        ? ?/sec
contrived/04x_entities_09_systems                                 1.13    196.7±5.23µs        ? ?/sec      1.00    174.0±2.10µs        ? ?/sec
contrived/04x_entities_12_systems                                 1.08    249.2±4.50µs        ? ?/sec      1.00    231.5±4.47µs        ? ?/sec
contrived/04x_entities_15_systems                                 1.07    306.2±6.26µs        ? ?/sec      1.00    284.8±4.49µs        ? ?/sec
contrived/05x_entities_03_systems                                 1.43     80.3±2.71µs        ? ?/sec      1.00     56.3±2.39µs        ? ?/sec
contrived/05x_entities_06_systems                                 1.53    162.0±5.56µs        ? ?/sec      1.00    105.6±3.98µs        ? ?/sec
contrived/05x_entities_09_systems                                 1.43    220.8±5.49µs        ? ?/sec      1.00    154.8±4.08µs        ? ?/sec
contrived/05x_entities_12_systems                                 1.40    269.8±9.32µs        ? ?/sec      1.00    192.7±4.09µs        ? ?/sec
contrived/05x_entities_15_systems                                 1.43   337.3±14.15µs        ? ?/sec      1.00    236.4±7.95µs        ? ?/sec
empty_commands/0_entities                                         1.01      5.3±0.05ns        ? ?/sec      1.00      5.2±0.07ns        ? ?/sec
fake_commands/2000_commands                                       1.02      7.3±0.06µs        ? ?/sec      1.00      7.2±0.08µs        ? ?/sec
fake_commands/4000_commands                                       1.01     14.5±0.09µs        ? ?/sec      1.00     14.3±0.11µs        ? ?/sec
fake_commands/6000_commands                                       1.01     21.7±0.12µs        ? ?/sec      1.00     21.5±0.08µs        ? ?/sec
fake_commands/8000_commands                                       1.00     28.7±0.15µs        ? ?/sec      1.01     29.0±0.18µs        ? ?/sec
few_changed_detection/50000_entities_change_detection::Sparse     1.49   261.7±35.23µs        ? ?/sec      1.00   175.9±36.43µs        ? ?/sec
few_changed_detection/50000_entities_change_detection::Table      1.00    129.0±4.86µs        ? ?/sec      1.03   132.7±31.12µs        ? ?/sec
few_changed_detection/60000_entities_change_detection::Sparse     1.56   301.5±44.18µs        ? ?/sec      1.00   193.1±15.92µs        ? ?/sec
few_changed_detection/60000_entities_change_detection::Table      1.27   198.2±11.89µs        ? ?/sec      1.00    155.8±3.68µs        ? ?/sec
get_or_spawn/batched                                              1.01   413.4±13.50µs        ? ?/sec      1.00   411.2±18.85µs        ? ?/sec
get_or_spawn/individual                                           1.02   737.8±63.64µs        ? ?/sec      1.00   721.7±41.04µs        ? ?/sec
heavy_compute/base                                                1.01    295.4±2.49µs        ? ?/sec      1.00    292.2±1.85µs        ? ?/sec
insert_commands/insert                                            1.00   616.8±30.12µs        ? ?/sec      1.03   635.5±35.98µs        ? ?/sec
insert_commands/insert_batch                                      1.01   418.4±24.28µs        ? ?/sec      1.00   412.5±25.14µs        ? ?/sec
insert_simple/base                                                1.00    359.5±2.71µs        ? ?/sec      1.18    425.3±5.51µs        ? ?/sec
insert_simple/unbatched                                           1.00   902.9±13.76µs        ? ?/sec      1.08   973.8±21.91µs        ? ?/sec
iter_fragmented/base                                              1.00    349.6±8.91ns        ? ?/sec      1.02    357.4±5.34ns        ? ?/sec
iter_fragmented/foreach                                           1.49   240.9±23.68ns        ? ?/sec      1.00   161.7±19.36ns        ? ?/sec
iter_fragmented/foreach_wide                                      1.00      4.0±0.12µs        ? ?/sec      1.02      4.0±0.49µs        ? ?/sec
iter_fragmented/wide                                              1.02      4.0±0.15µs        ? ?/sec      1.00      3.9±0.13µs        ? ?/sec
iter_fragmented_sparse/base                                       1.03      9.0±0.91ns        ? ?/sec      1.00      8.7±0.49ns        ? ?/sec
iter_fragmented_sparse/foreach                                    1.00      7.7±0.13ns        ? ?/sec      1.01      7.8±0.29ns        ? ?/sec
iter_fragmented_sparse/foreach_wide                               1.00     41.2±3.56ns        ? ?/sec      1.08     44.5±0.49ns        ? ?/sec
iter_fragmented_sparse/wide                                       1.04    45.8±11.87ns        ? ?/sec      1.00     44.1±1.01ns        ? ?/sec
iter_simple/base                                                  1.00      8.4±0.24µs        ? ?/sec      1.00      8.4±0.11µs        ? ?/sec
iter_simple/foreach                                               1.00      8.3±0.06µs        ? ?/sec      1.03      8.5±0.11µs        ? ?/sec
iter_simple/foreach_sparse_set                                    1.01     26.1±0.17µs        ? ?/sec      1.00     25.8±0.25µs        ? ?/sec
iter_simple/foreach_wide                                          1.03     40.0±0.28µs        ? ?/sec      1.00     38.7±1.11µs        ? ?/sec
iter_simple/foreach_wide_sparse_set                               1.02    117.0±1.69µs        ? ?/sec      1.00    115.0±0.72µs        ? ?/sec
iter_simple/sparse_set                                            1.01     28.7±0.22µs        ? ?/sec      1.00     28.5±0.18µs        ? ?/sec
iter_simple/system                                                1.00      8.3±0.13µs        ? ?/sec      1.01      8.4±0.07µs        ? ?/sec
iter_simple/wide                                                  1.06     41.6±0.74µs        ? ?/sec      1.00     39.3±0.98µs        ? ?/sec
iter_simple/wide_sparse_set                                       1.00    125.1±1.18µs        ? ?/sec      1.01    126.4±1.67µs        ? ?/sec
none_changed_detection/50000_entities_change_detection::Sparse    1.05   100.1±22.18µs        ? ?/sec      1.00     95.7±9.15µs        ? ?/sec
none_changed_detection/50000_entities_change_detection::Table     1.02     78.7±4.28µs        ? ?/sec      1.00     77.2±3.74µs        ? ?/sec
none_changed_detection/60000_entities_change_detection::Sparse    1.06   154.4±34.35µs        ? ?/sec      1.00   145.0±23.25µs        ? ?/sec
none_changed_detection/60000_entities_change_detection::Table     1.02   119.2±12.97µs        ? ?/sec      1.00    116.7±3.00µs        ? ?/sec
query_get/50000_entities_sparse                                   1.00    318.2±4.09µs        ? ?/sec      1.03   327.5±20.06µs        ? ?/sec
query_get/50000_entities_table                                    1.00    306.3±3.95µs        ? ?/sec      1.02    311.6±5.34µs        ? ?/sec
query_get_component/50000_entities_sparse                         1.00   975.1±40.73µs        ? ?/sec      1.03  1000.2±40.68µs        ? ?/sec
query_get_component/50000_entities_table                          1.05   1079.7±7.35µs        ? ?/sec      1.00  1029.7±13.87µs        ? ?/sec
query_get_component_simple/system                                 1.00    747.2±5.61µs        ? ?/sec      1.02   762.1±13.26µs        ? ?/sec
query_get_component_simple/unchecked                              1.00   862.2±15.25µs        ? ?/sec      1.13   977.4±16.50µs        ? ?/sec
query_get_many_10/50000_calls_sparse                              1.00      4.1±0.33ms        ? ?/sec      1.08      4.5±0.44ms        ? ?/sec
query_get_many_10/50000_calls_table                               1.02      4.2±0.15ms        ? ?/sec      1.00      4.1±0.16ms        ? ?/sec
query_get_many_2/50000_calls_sparse                               1.00   649.1±51.71µs        ? ?/sec      1.01   655.5±71.27µs        ? ?/sec
query_get_many_2/50000_calls_table                                1.00   707.5±40.72µs        ? ?/sec      1.00   704.4±51.65µs        ? ?/sec
query_get_many_5/50000_calls_sparse                               1.01  1964.7±99.43µs        ? ?/sec      1.00  1953.2±110.73µs        ? ?/sec
query_get_many_5/50000_calls_table                                1.01  1940.1±94.22µs        ? ?/sec      1.00  1915.9±88.00µs        ? ?/sec
run_criteria/yes_using_query/001_systems                          1.05      3.9±0.14µs        ? ?/sec      1.00      3.7±0.17µs        ? ?/sec
run_criteria/yes_using_query/006_systems                          1.00      8.5±0.31µs        ? ?/sec      1.04      8.9±0.29µs        ? ?/sec
run_criteria/yes_using_query/011_systems                          1.00     13.1±0.61µs        ? ?/sec      1.04     13.7±0.44µs        ? ?/sec
run_criteria/yes_using_query/016_systems                          1.00     18.4±0.75µs        ? ?/sec      1.03     19.0±0.75µs        ? ?/sec
run_criteria/yes_using_query/021_systems                          1.00     23.4±1.01µs        ? ?/sec      1.04     24.2±0.47µs        ? ?/sec
run_criteria/yes_using_query/026_systems                          1.00     28.5±0.80µs        ? ?/sec      1.02     29.1±0.65µs        ? ?/sec
run_criteria/yes_using_query/031_systems                          1.00     32.6±1.00µs        ? ?/sec      1.04     33.9±0.85µs        ? ?/sec
run_criteria/yes_using_query/036_systems                          1.00     37.2±1.41µs        ? ?/sec      1.05     39.2±1.12µs        ? ?/sec
run_criteria/yes_using_query/041_systems                          1.00     42.4±1.04µs        ? ?/sec      1.04     44.0±1.04µs        ? ?/sec
run_criteria/yes_using_query/046_systems                          1.00     46.1±1.34µs        ? ?/sec      1.06     48.8±1.47µs        ? ?/sec
run_criteria/yes_using_query/051_systems                          1.00     49.9±1.84µs        ? ?/sec      1.07     53.5±1.48µs        ? ?/sec
run_criteria/yes_using_query/056_systems                          1.00     54.4±2.28µs        ? ?/sec      1.07     58.4±1.72µs        ? ?/sec
run_criteria/yes_using_query/061_systems                          1.00     61.0±4.03µs        ? ?/sec      1.04     63.3±3.09µs        ? ?/sec
run_criteria/yes_using_query/066_systems                          1.00     66.7±2.69µs        ? ?/sec      1.06     71.0±2.63µs        ? ?/sec
run_criteria/yes_using_query/071_systems                          1.00     70.4±3.09µs        ? ?/sec      1.09     76.8±1.88µs        ? ?/sec
run_criteria/yes_using_query/076_systems                          1.00     76.3±2.99µs        ? ?/sec      1.07     81.7±3.53µs        ? ?/sec
run_criteria/yes_using_query/081_systems                          1.00     82.6±5.01µs        ? ?/sec      1.07     88.1±3.29µs        ? ?/sec
run_criteria/yes_using_query/086_systems                          1.00     88.1±4.59µs        ? ?/sec      1.07     94.3±4.95µs        ? ?/sec
run_criteria/yes_using_query/091_systems                          1.00     92.2±3.34µs        ? ?/sec      1.12    103.2±3.86µs        ? ?/sec
run_criteria/yes_using_query/096_systems                          1.00     96.3±5.89µs        ? ?/sec      1.13    108.8±5.45µs        ? ?/sec
run_criteria/yes_using_query/101_systems                          1.00    107.5±5.16µs        ? ?/sec      1.08    116.3±4.08µs        ? ?/sec
run_criteria/yes_using_resource/001_systems                       1.00      3.4±0.16µs        ? ?/sec      1.13      3.8±0.20µs        ? ?/sec
run_criteria/yes_using_resource/006_systems                       1.00      8.3±0.33µs        ? ?/sec      1.07      8.9±0.30µs        ? ?/sec
run_criteria/yes_using_resource/011_systems                       1.00     13.7±0.55µs        ? ?/sec      1.01     13.8±0.61µs        ? ?/sec
run_criteria/yes_using_resource/016_systems                       1.00     18.5±0.71µs        ? ?/sec      1.05     19.4±0.73µs        ? ?/sec
run_criteria/yes_using_resource/021_systems                       1.00     23.2±0.91µs        ? ?/sec      1.04     24.2±0.94µs        ? ?/sec
run_criteria/yes_using_resource/026_systems                       1.00     28.2±0.97µs        ? ?/sec      1.03     28.9±0.88µs        ? ?/sec
run_criteria/yes_using_resource/031_systems                       1.00     33.5±0.62µs        ? ?/sec      1.02     34.1±0.98µs        ? ?/sec
run_criteria/yes_using_resource/036_systems                       1.00     37.9±0.89µs        ? ?/sec      1.02     38.8±1.27µs        ? ?/sec
run_criteria/yes_using_resource/041_systems                       1.00     42.3±0.94µs        ? ?/sec      1.03     43.7±1.38µs        ? ?/sec
run_criteria/yes_using_resource/046_systems                       1.00     47.3±0.92µs        ? ?/sec      1.00     47.4±3.94µs        ? ?/sec
run_criteria/yes_using_resource/051_systems                       1.00     51.8±1.61µs        ? ?/sec      1.03     53.6±2.16µs        ? ?/sec
run_criteria/yes_using_resource/056_systems                       1.00     56.5±1.75µs        ? ?/sec      1.04     58.8±2.68µs        ? ?/sec
run_criteria/yes_using_resource/061_systems                       1.00     61.5±1.63µs        ? ?/sec      1.07     65.7±1.96µs        ? ?/sec
run_criteria/yes_using_resource/066_systems                       1.00     68.2±1.95µs        ? ?/sec      1.04     70.9±2.98µs        ? ?/sec
run_criteria/yes_using_resource/071_systems                       1.00     72.9±2.42µs        ? ?/sec      1.05     76.3±2.99µs        ? ?/sec
run_criteria/yes_using_resource/076_systems                       1.00     76.9±2.67µs        ? ?/sec      1.06     81.2±4.19µs        ? ?/sec
run_criteria/yes_using_resource/081_systems                       1.00     80.8±4.26µs        ? ?/sec      1.09     88.3±3.82µs        ? ?/sec
run_criteria/yes_using_resource/086_systems                       1.00     88.4±4.22µs        ? ?/sec      1.08     95.5±3.37µs        ? ?/sec
run_criteria/yes_using_resource/091_systems                       1.00     95.9±3.41µs        ? ?/sec      1.05    100.6±4.76µs        ? ?/sec
run_criteria/yes_using_resource/096_systems                       1.00    102.5±4.60µs        ? ?/sec      1.05    108.0±3.88µs        ? ?/sec
run_criteria/yes_using_resource/101_systems                       1.00    106.8±4.89µs        ? ?/sec      1.11    118.0±4.52µs        ? ?/sec
sized_commands_0_bytes/2000_commands                              1.00      5.1±0.03µs        ? ?/sec      1.09      5.6±0.07µs        ? ?/sec
sized_commands_0_bytes/4000_commands                              1.00     10.2±0.10µs        ? ?/sec      1.10     11.3±0.10µs        ? ?/sec
sized_commands_0_bytes/6000_commands                              1.00     15.2±0.11µs        ? ?/sec      1.10     16.8±0.10µs        ? ?/sec
sized_commands_0_bytes/8000_commands                              1.00     20.3±0.21µs        ? ?/sec      1.12     22.7±0.29µs        ? ?/sec
sized_commands_12_bytes/2000_commands                             1.02      7.4±0.09µs        ? ?/sec      1.00      7.2±0.08µs        ? ?/sec
sized_commands_12_bytes/4000_commands                             1.00     14.7±0.08µs        ? ?/sec      1.00     14.6±0.12µs        ? ?/sec
sized_commands_12_bytes/6000_commands                             1.00     22.1±0.10µs        ? ?/sec      1.00     22.0±0.18µs        ? ?/sec
sized_commands_12_bytes/8000_commands                             1.00     29.5±0.14µs        ? ?/sec      1.00     29.4±0.20µs        ? ?/sec
sized_commands_512_bytes/2000_commands                            1.00     51.7±1.68µs        ? ?/sec      1.05     54.1±2.70µs        ? ?/sec
sized_commands_512_bytes/4000_commands                            1.00    105.9±9.28µs        ? ?/sec      1.05    110.8±7.73µs        ? ?/sec
sized_commands_512_bytes/6000_commands                            1.00   162.1±20.46µs        ? ?/sec      1.04   169.2±22.51µs        ? ?/sec
sized_commands_512_bytes/8000_commands                            1.00   219.2±36.11µs        ? ?/sec      1.04   228.5±33.18µs        ? ?/sec
spawn_commands/2000_entities                                      1.00    184.7±5.82µs        ? ?/sec      1.05    193.2±5.30µs        ? ?/sec
spawn_commands/4000_entities                                      1.00   368.7±12.10µs        ? ?/sec      1.05   385.7±13.76µs        ? ?/sec
spawn_commands/8000_entities                                      1.00   754.7±26.15µs        ? ?/sec      1.03   777.9±21.58µs        ? ?/sec
spawn_world/10000_entities                                        1.00  1023.4±80.40µs        ? ?/sec      1.02  1040.3±84.50µs        ? ?/sec
spawn_world/1000_entities                                         1.00    101.8±8.01µs        ? ?/sec      1.04    106.4±9.10µs        ? ?/sec
spawn_world/100_entities                                          1.00     10.3±0.92µs        ? ?/sec      1.02     10.5±0.92µs        ? ?/sec
spawn_world/10_entities                                           1.00  1020.8±85.81ns        ? ?/sec      1.02  1043.8±99.74ns        ? ?/sec
world_entity/50000_entities                                       1.00     94.8±0.87µs        ? ?/sec      1.00     95.1±0.86µs        ? ?/sec
world_get/50000_entities_sparse                                   1.00    353.7±1.78µs        ? ?/sec      1.03    363.7±7.35µs        ? ?/sec
world_get/50000_entities_table                                    1.03    386.8±7.23µs        ? ?/sec      1.00    376.8±4.97µs        ? ?/sec
world_query_for_each/50000_entities_sparse                        1.01     47.9±0.63µs        ? ?/sec      1.00     47.6±0.27µs        ? ?/sec
world_query_for_each/50000_entities_table                         1.00     27.2±0.30µs        ? ?/sec      1.00     27.3±0.18µs        ? ?/sec
world_query_get/50000_entities_sparse_wide                        1.00    192.7±1.03µs        ? ?/sec      1.01    195.5±1.24µs        ? ?/sec
world_query_get/50000_entities_table                              1.00    137.2±3.03µs        ? ?/sec      1.00    137.6±0.85µs        ? ?/sec
world_query_get/50000_entities_table_wide                         1.00    242.7±1.90µs        ? ?/sec      1.01    245.0±4.77µs        ? ?/sec
world_query_iter/50000_entities_sparse                            1.00     54.1±0.34µs        ? ?/sec      1.01     54.5±0.33µs        ? ?/sec
world_query_iter/50000_entities_table                             1.00     27.3±0.19µs        ? ?/sec      1.00     27.3±0.79µs        ? ?/sec

james7132 avatar Nov 11 '22 13:11 james7132

Sans the API surface, this is ready for review.

james7132 avatar Nov 11 '22 14:11 james7132

Did some further investigation on why the perf difference grew bigger between for_each and iter. for_each will autovectorize if possible, while something else in iter is blocking the same optimization.

The following is the same hot section of code in the original post, but in for_each instead of iter. Note it's use of %xmm* (SSE) registers, and addps a 4xf32 SIMD instruction.

.LBB2_13:
	movups	(%rdi,%rax,4), %xmm1
	movups	16(%rdi,%rax,4), %xmm2
	movdqu	%xmm0, (%rsi,%rax,4)
	movdqu	%xmm0, 16(%rsi,%rax,4)
	movups	(%rdx,%rax,4), %xmm3
	addps	%xmm1, %xmm3
	movups	16(%rdx,%rax,4), %xmm1
	addps	%xmm2, %xmm1
	movups	%xmm3, (%rdx,%rax,4)
	movups	%xmm1, 16(%rdx,%rax,4)
	movups	32(%rdi,%rax,4), %xmm1
	movups	48(%rdi,%rax,4), %xmm2
	movdqu	%xmm0, 32(%rsi,%rax,4)
	movdqu	%xmm0, 48(%rsi,%rax,4)
	movups	32(%rdx,%rax,4), %xmm3
	addps	%xmm1, %xmm3
	movups	48(%rdx,%rax,4), %xmm1
	addps	%xmm2, %xmm1
	movups	%xmm3, 32(%rdx,%rax,4)
	movups	%xmm1, 48(%rdx,%rax,4)
	addq	$16, %rax
	addq	$-2, %rcx
	jne	.LBB2_13
	testb	$1, %r15b
	je	.LBB2_16

james7132 avatar Nov 12 '22 12:11 james7132

Much nicer! You have my approval now. Still curious about the gap between for_each and iter, but that's a question for another day.

I'm almost 100% sure it's the autovectorization gap mentioned above. for_each is much more readily using it over iter.

james7132 avatar Nov 12 '22 14:11 james7132

bors try

james7132 avatar Nov 14 '22 08:11 james7132

bors r+

alice-i-cecile avatar Nov 21 '22 12:11 alice-i-cecile