ecs_bench_suite
ecs_bench_suite copied to clipboard
Heavy compute does not give a good comparison of parallel iter
Hi,
I think there should be a benchmark that compares how the libraries handle parallel iteration. Currently, the closest test for this would be heavy_compute, but the task (inverting a matrix 100 times) is not fine-grained enough to make a comparison of the parallel overhead (there is too much work per item).
I propose either:
- reducing the task in the parallel loop of
heavy_compute(e.g., to inverting the matrix once, or multiplying a float value, something very small) - Or introducing a new
parallel_light_computebenchmark.
An example of option two is here: https://github.com/ElliotB256/ecs_bench_suite/tree/parallel_light_compute Further discussion can be found here: https://github.com/bevyengine/bevy/issues/2173
The current heavy_compute shows bevy as about ~2x slower than specs. However, parallel_light_compute (see discussion) shows bevy is very sensitive to batch size and can be anywhere up to 10x slower than e.g. specs.
The current heavy_compute shows bevy as about ~2x slower than specs. However, parallel_light_compute (see discussion) shows bevy is very sensitive to batch size and can be anywhere up to 10x slower than e.g. specs.
In my results (where I merged your and other forks and updated all libraries among some adjustments) bevy is only 2x slower than specs in parallel_light_compute and actually faster than the other libraries. It might be sensitive on thread count as well (I ran it on a 16c/32t system), or the situation improved drastically between bevy 0.5 and 0.6.
Thanks for looking!
However, a note: bevy is extremely sensitive to batch size, while other libraries don't need a batch size to be set. Your file shows a batch size set to 1024. In the discussion I posted above, you'll find the following table which shows bevy scaling with batch size:
| Batch Size | Time |
|---|---|
| 8 | 1.177ms |
| 64 | 234.13us |
| 256 | 149.48us |
| 1024 | 130.48us |
| 4096 | 207.13us |
| 10,000 | 485.55us |
On my pc, 1024 was the optimum batch size for bevy. For comparison, specs was 108.00 us, so bevy was about ~2x slower than specs. However, in the worst case scenario of unoptimised batch size, bevy remains >10x slower (hence my numbers in first post). I expect the 'ideal' batch size is both hardware and System dependent, and the optimum will be rarely achieved.
(Disclaimer: my tests are still for bevy 0.5 and I didn't get time to run comparisons for 0.6 yet! but my understanding is the parallel performance did not change from other discussions).