zarrs-python icon indicating copy to clipboard operation
zarrs-python copied to clipboard

Performance on Different Systems

Open ilan-gold opened this issue 1 year ago • 8 comments

The current benchmark on my mac differs wildly from that of linux...not much more to say. A lot of users on mac, would be great to understand this,

ilan-gold avatar Sep 26 '24 08:09 ilan-gold

Below are some benchmarks on my system. The memory usage of zarrs-python is curious.

Read all

image

Chunk by chunk

image

Image Concurrency Time (s)
zarrs
rust

tensorstore
python

zarr
python

zarrs
python
Memory (GB)
zarrs
rust

tensorstore
python

zarr
python

zarrs
python
0 data/benchmark.zarr 1 28.98 53.38 88.01 59.04 0.03 0.10 0.10 0.12
1 data/benchmark.zarr 2 14.77 30.28 75.53 46.94 0.03 0.31 0.31 8.72
2 data/benchmark.zarr 4 8.19 23.61 73.30 47.39 0.03 0.31 0.31 8.72
3 data/benchmark.zarr 8 4.35 22.59 70.71 49.45 0.03 0.31 0.32 8.72
4 data/benchmark.zarr 16 2.82 21.37 62.97 48.89 0.03 0.34 0.32 8.72
5 data/benchmark.zarr 32 2.48 19.34 58.23 47.42 0.03 0.34 0.32 8.72
6 data/benchmark_compress.zarr 1 22.88 47.06 101.05 51.15 0.03 0.10 0.13 0.12
7 data/benchmark_compress.zarr 2 12.57 28.03 94.76 38.57 0.03 0.32 0.34 8.72
8 data/benchmark_compress.zarr 4 7.03 23.53 95.64 38.65 0.03 0.32 0.34 8.71
9 data/benchmark_compress.zarr 8 3.92 21.15 84.48 37.77 0.03 0.32 0.34 8.72
10 data/benchmark_compress.zarr 16 2.26 19.33 77.08 39.26 0.04 0.34 0.34 8.72
11 data/benchmark_compress.zarr 32 2.05 17.30 70.61 38.28 0.04 0.35 0.35 8.71
12 data/benchmark_compress_shard.zarr 1 2.17 2.73 33.60 3.37 0.37 0.60 0.89 0.68
13 data/benchmark_compress_shard.zarr 2 1.62 2.26 28.78 3.67 0.70 0.90 1.40 8.81
14 data/benchmark_compress_shard.zarr 4 1.39 2.04 28.45 3.71 1.30 1.07 2.43 8.80
15 data/benchmark_compress_shard.zarr 8 1.35 1.93 27.81 3.60 2.36 1.43 4.72 8.81
16 data/benchmark_compress_shard.zarr 16 1.44 2.69 27.68 3.42 4.43 1.74 9.27 8.80
17 data/benchmark_compress_shard.zarr 32 2.07 2.20 31.37 3.41 6.66 2.94 18.41 8.81

LDeakin avatar Sep 28 '24 07:09 LDeakin

@LDeakin that's quite in line with what I saw pre-security shutdown on my mac. The read-all made sense intuitively (rust plus a bit of overhead), the chunk-by-chunk was tougher to pin down, so good to see it's reproducible. Thanks so much for this. I think the memory usage/flat performance is accounted for by the fact that I'm hoovering up all available threads by default. I think there's an issue for making this configurable? Not sure what the best way was though, env variable or API but API is tough because of the current rust + python bridge not having a public API

ilan-gold avatar Sep 28 '24 08:09 ilan-gold

@LDeakin looking into the parallelism a bit on my end. We are basically following their directions to the T, at least on the rust side, if performance is our concern: https://pyo3.rs/v0.22.2/parallelism

Something that pops out to me - any thoughts on why the sharding might be so performant across teh board for the compiled stuff?

ilan-gold avatar Oct 02 '24 20:10 ilan-gold

Re: teh above link, could also be some overhead of async + careless holding of the GIL as phil pointed out. Maybe we could release GIL and allow "true" python-level threading as the example shows? That doesn't really account for the concurrent_chunks=1 difference though...so I'd guess our overhead is in the extraction of the python types. I read that declaring types ahead of time can be a boost so it might be worth trying that.

ilan-gold avatar Oct 02 '24 20:10 ilan-gold

Although that then doesn't account for the sharding, although that might not be working at all....I don't think I accounted for that so it's possible my code is erroring silently unless the code I have somehow accounts for it

ilan-gold avatar Oct 02 '24 20:10 ilan-gold

Something that pops out to me - any thoughts on why the sharding might be so performant across teh board for the compiled stuff?

There are two areas where parallelism can be applied, in codecs and across chunks. Both zarrs and tensorstore use all available cores (where possible/efficient) by default. That chunk-by-chunk benchmark limits the number of chunks decoded concurrently, but still uses all available threads for decoding.

Sharding is an example of a codec extremely well-suited to parallel encoding/decoding. That benchmark has many "inner chunks" per "shard" (chunk), so the cores are getting well utilised by the compiled implementations even if only decoding 1 chunk at a time. I'd assume zarr-python sharding is single-threaded.

Relevant documentation for zarrs:

  • https://docs.rs/zarrs/latest/zarrs/array/struct.Array.html#parallelism-and-concurrency
  • https://docs.rs/zarrs/latest/zarrs/config/struct.Config.html#codec-concurrent-target
  • https://docs.rs/zarrs/latest/zarrs/config/struct.Config.html#chunk-concurrent-minimum

If parallelism is external to zarrs (e.g. multiple concurrent Array::retrieve_ ops), it would be preferable to reduce the concurrent target to avoid potential thrashing. This can be done for individual retrieve/store operations with CodecOptions or by setting the global configuration.

LDeakin avatar Oct 02 '24 21:10 LDeakin

Also looking at that benchmark again, do you have an idea of where the large allocation (8.7GB) is occurring in zarrs-python when concurrent chunks > 1?

LDeakin avatar Oct 02 '24 21:10 LDeakin

it would be preferable to reduce the concurrent target to avoid potential thrashing

Quoting myself... but thrashing is not the right term here. That does not really happen with Rayon work stealing. It is more just a suboptimal work distribution. Defaults might be perfectly okay!

LDeakin avatar Oct 02 '24 21:10 LDeakin