Performance on Different Systems
The current benchmark on my mac differs wildly from that of linux...not much more to say. A lot of users on mac, would be great to understand this,
Below are some benchmarks on my system. The memory usage of zarrs-python is curious.
Read all
Chunk by chunk
| Image | Concurrency | Time (s) zarrs rust |
tensorstore python |
zarr python |
zarrs python |
Memory (GB) zarrs rust |
tensorstore python |
zarr python |
zarrs python |
|
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | data/benchmark.zarr | 1 | 28.98 | 53.38 | 88.01 | 59.04 | 0.03 | 0.10 | 0.10 | 0.12 |
| 1 | data/benchmark.zarr | 2 | 14.77 | 30.28 | 75.53 | 46.94 | 0.03 | 0.31 | 0.31 | 8.72 |
| 2 | data/benchmark.zarr | 4 | 8.19 | 23.61 | 73.30 | 47.39 | 0.03 | 0.31 | 0.31 | 8.72 |
| 3 | data/benchmark.zarr | 8 | 4.35 | 22.59 | 70.71 | 49.45 | 0.03 | 0.31 | 0.32 | 8.72 |
| 4 | data/benchmark.zarr | 16 | 2.82 | 21.37 | 62.97 | 48.89 | 0.03 | 0.34 | 0.32 | 8.72 |
| 5 | data/benchmark.zarr | 32 | 2.48 | 19.34 | 58.23 | 47.42 | 0.03 | 0.34 | 0.32 | 8.72 |
| 6 | data/benchmark_compress.zarr | 1 | 22.88 | 47.06 | 101.05 | 51.15 | 0.03 | 0.10 | 0.13 | 0.12 |
| 7 | data/benchmark_compress.zarr | 2 | 12.57 | 28.03 | 94.76 | 38.57 | 0.03 | 0.32 | 0.34 | 8.72 |
| 8 | data/benchmark_compress.zarr | 4 | 7.03 | 23.53 | 95.64 | 38.65 | 0.03 | 0.32 | 0.34 | 8.71 |
| 9 | data/benchmark_compress.zarr | 8 | 3.92 | 21.15 | 84.48 | 37.77 | 0.03 | 0.32 | 0.34 | 8.72 |
| 10 | data/benchmark_compress.zarr | 16 | 2.26 | 19.33 | 77.08 | 39.26 | 0.04 | 0.34 | 0.34 | 8.72 |
| 11 | data/benchmark_compress.zarr | 32 | 2.05 | 17.30 | 70.61 | 38.28 | 0.04 | 0.35 | 0.35 | 8.71 |
| 12 | data/benchmark_compress_shard.zarr | 1 | 2.17 | 2.73 | 33.60 | 3.37 | 0.37 | 0.60 | 0.89 | 0.68 |
| 13 | data/benchmark_compress_shard.zarr | 2 | 1.62 | 2.26 | 28.78 | 3.67 | 0.70 | 0.90 | 1.40 | 8.81 |
| 14 | data/benchmark_compress_shard.zarr | 4 | 1.39 | 2.04 | 28.45 | 3.71 | 1.30 | 1.07 | 2.43 | 8.80 |
| 15 | data/benchmark_compress_shard.zarr | 8 | 1.35 | 1.93 | 27.81 | 3.60 | 2.36 | 1.43 | 4.72 | 8.81 |
| 16 | data/benchmark_compress_shard.zarr | 16 | 1.44 | 2.69 | 27.68 | 3.42 | 4.43 | 1.74 | 9.27 | 8.80 |
| 17 | data/benchmark_compress_shard.zarr | 32 | 2.07 | 2.20 | 31.37 | 3.41 | 6.66 | 2.94 | 18.41 | 8.81 |
@LDeakin that's quite in line with what I saw pre-security shutdown on my mac. The read-all made sense intuitively (rust plus a bit of overhead), the chunk-by-chunk was tougher to pin down, so good to see it's reproducible. Thanks so much for this. I think the memory usage/flat performance is accounted for by the fact that I'm hoovering up all available threads by default. I think there's an issue for making this configurable? Not sure what the best way was though, env variable or API but API is tough because of the current rust + python bridge not having a public API
@LDeakin looking into the parallelism a bit on my end. We are basically following their directions to the T, at least on the rust side, if performance is our concern: https://pyo3.rs/v0.22.2/parallelism
Something that pops out to me - any thoughts on why the sharding might be so performant across teh board for the compiled stuff?
Re: teh above link, could also be some overhead of async + careless holding of the GIL as phil pointed out. Maybe we could release GIL and allow "true" python-level threading as the example shows? That doesn't really account for the concurrent_chunks=1 difference though...so I'd guess our overhead is in the extraction of the python types. I read that declaring types ahead of time can be a boost so it might be worth trying that.
Although that then doesn't account for the sharding, although that might not be working at all....I don't think I accounted for that so it's possible my code is erroring silently unless the code I have somehow accounts for it
Something that pops out to me - any thoughts on why the sharding might be so performant across teh board for the compiled stuff?
There are two areas where parallelism can be applied, in codecs and across chunks. Both zarrs and tensorstore use all available cores (where possible/efficient) by default. That chunk-by-chunk benchmark limits the number of chunks decoded concurrently, but still uses all available threads for decoding.
Sharding is an example of a codec extremely well-suited to parallel encoding/decoding. That benchmark has many "inner chunks" per "shard" (chunk), so the cores are getting well utilised by the compiled implementations even if only decoding 1 chunk at a time. I'd assume zarr-python sharding is single-threaded.
Relevant documentation for zarrs:
- https://docs.rs/zarrs/latest/zarrs/array/struct.Array.html#parallelism-and-concurrency
- https://docs.rs/zarrs/latest/zarrs/config/struct.Config.html#codec-concurrent-target
- https://docs.rs/zarrs/latest/zarrs/config/struct.Config.html#chunk-concurrent-minimum
If parallelism is external to zarrs (e.g. multiple concurrent Array::retrieve_ ops), it would be preferable to reduce the concurrent target to avoid potential thrashing. This can be done for individual retrieve/store operations with CodecOptions or by setting the global configuration.
Also looking at that benchmark again, do you have an idea of where the large allocation (8.7GB) is occurring in zarrs-python when concurrent chunks > 1?
it would be preferable to reduce the concurrent target to avoid potential thrashing
Quoting myself... but thrashing is not the right term here. That does not really happen with Rayon work stealing. It is more just a suboptimal work distribution. Defaults might be perfectly okay!