ecs_bench_suite icon indicating copy to clipboard operation
ecs_bench_suite copied to clipboard

Heavy compute does not give a good comparison of parallel iter

Open ElliotB256 opened this issue 3 years ago • 2 comments

Hi,

I think there should be a benchmark that compares how the libraries handle parallel iteration. Currently, the closest test for this would be heavy_compute, but the task (inverting a matrix 100 times) is not fine-grained enough to make a comparison of the parallel overhead (there is too much work per item).

I propose either:

  • reducing the task in the parallel loop of heavy_compute (e.g., to inverting the matrix once, or multiplying a float value, something very small)
  • Or introducing a new parallel_light_compute benchmark.

An example of option two is here: https://github.com/ElliotB256/ecs_bench_suite/tree/parallel_light_compute Further discussion can be found here: https://github.com/bevyengine/bevy/issues/2173

The current heavy_compute shows bevy as about ~2x slower than specs. However, parallel_light_compute (see discussion) shows bevy is very sensitive to batch size and can be anywhere up to 10x slower than e.g. specs.

ElliotB256 avatar Jan 20 '22 20:01 ElliotB256

The current heavy_compute shows bevy as about ~2x slower than specs. However, parallel_light_compute (see discussion) shows bevy is very sensitive to batch size and can be anywhere up to 10x slower than e.g. specs.

In my results (where I merged your and other forks and updated all libraries among some adjustments) bevy is only 2x slower than specs in parallel_light_compute and actually faster than the other libraries. It might be sensitive on thread count as well (I ran it on a 16c/32t system), or the situation improved drastically between bevy 0.5 and 0.6.

Systemcluster avatar Jan 24 '22 12:01 Systemcluster

Thanks for looking!

However, a note: bevy is extremely sensitive to batch size, while other libraries don't need a batch size to be set. Your file shows a batch size set to 1024. In the discussion I posted above, you'll find the following table which shows bevy scaling with batch size:

Batch Size Time
8 1.177ms
64 234.13us
256 149.48us
1024 130.48us
4096 207.13us
10,000 485.55us

On my pc, 1024 was the optimum batch size for bevy. For comparison, specs was 108.00 us, so bevy was about ~2x slower than specs. However, in the worst case scenario of unoptimised batch size, bevy remains >10x slower (hence my numbers in first post). I expect the 'ideal' batch size is both hardware and System dependent, and the optimum will be rarely achieved.

(Disclaimer: my tests are still for bevy 0.5 and I didn't get time to run comparisons for 0.6 yet! but my understanding is the parallel performance did not change from other discussions).

ElliotB256 avatar Jan 24 '22 13:01 ElliotB256