xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Differences in time to load sample based on `chunks` parameter in `xr.open_mfdataset()`

Open dfulu opened this issue 1 year ago • 1 comments

What is your issue?

A common use case for us working with xarray + zarr for ML is sampling chunks of data from multiple zarr files. I have noticed that the time taken to load a sample differs significantly based on whether I set chunks="auto" or chunks={} in xr.open_mfdataset().

I understand that this parameter changes the chunk size which dask is using to access the zarr file, and so a difference in speed might be expected. However, in my use case I am generally sampling chunks of data which are smaller than the chunk size on disk. So often only 1 chunk needs to be loaded from the disk to create the sample.

When using chunks="auto" dask increases the size of the chunks it uses to access the zarr file for my data since it is highly chunked. However I find that this leads to faster load time than when I set chunks={} and force dask to use the chunk size as on disk.

This seems counter-intuitive to me. I would expect that with chunks={} it should be faster to load a piece of data which is smaller than the chunk size. I am assuming when using chunks="auto" and dask makes the chunk size larger, it will load more data from disk and then discard most of it to slice down to the requested piece.

I have made this minimal working example which demonstrates the issue. I've tested it across my available linux server and mac laptop and found consistent results.

Could anyone advise on why chunks={} is slower and what the best practice might be in this case?

Thanks!

dfulu avatar Oct 11 '24 21:10 dfulu

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

welcome[bot] avatar Oct 11 '24 21:10 welcome[bot]

Sorry you never got a response on this @dfulu. Your finding actually matches my expectation. When you use chunks="auto" dask uses it's heuristics to create dask chunks that are some multiple of the number on-disk chunks. Dask tends to want those chunks to be ~100MB. That size works well because it is big enough that the computation time per-task tends to substantially bigger than the overhead associated with the task. Basically you don't want a million tiny tasks in dask.

jsignell avatar Dec 10 '25 19:12 jsignell

Thanks for the response @jsignell. I'll close this now

dfulu avatar Dec 11 '25 14:12 dfulu