xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Add initialize_zarr

Open dcherian opened this issue 1 year ago • 11 comments

  • [x] Closes #8343
  • [x] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [x] New functions/methods are listed in api.rst

The intended pattern is:


    after_init = initialize_zarr(store, ds, region_dims=("x",))
    for i in range(ds.sizes["x"]):
        after_init.isel(x=[i]).to_zarr(store, region={"x": slice(i, i + 1)})

cc @slevang

dcherian avatar Nov 16 '23 19:11 dcherian

Going that route would potentially require resolving the issue of needing to drop non-region containing variables that you address here though, or at least more documentation.

Yes, i was trying to make an easy recommended pattern for region writes. To do that, it seemed like to_zarr wasn't the right place.

dcherian avatar Nov 16 '23 22:11 dcherian

I just pushed a little experiment that could be a really nice path forward here: c0cf4ee.

Demo:

ds = xr.tutorial.open_dataset('rasm').chunk('5mb')

# baseline
mapper_a = fs.get_mapper("s3://foo/init-zarr-test-a")
%time xr.zeros_like(ds).to_zarr(mapper_a, mode='w', compute=False)
# CPU times: user 296 ms, sys: 7.16 ms, total: 303 ms
# Wall time: 2.75 s

# my most recent commit
mapper_b = fs.get_mapper("s3://foo/init-zarr-test-b")
%time store = init_zarr_v2(ds, store=mapper_b)
# CPU times: user 41.4 ms, sys: 8.19 ms, total: 49.6 ms
# Wall time: 144 ms

# it works!
a = xr.open_zarr(mapper_a)
b = xr.open_zarr(mapper_b)
xr.testing.assert_equal(a, b)

The key realization I made is that if we aren't going to write the arrays, we can just use the Zarr API to create the metadata objects and push them to store.

Some caveats about what I just pushed:

  • Missing a bunch of feature flags, tests, etc.
  • Currently makes an assumption that the target store is clean and ready to be initialized
  • May also be missing some specialized encoding around strings / object dtypes. This is all in the Zarr backend so could likely be abstracted away to share a common code path.

jhamman avatar Nov 17 '23 18:11 jhamman

Nice!!


(One small difference between that and the PR — we possibly should write any var that doesn't have a region dim, not only the indexes)

max-sixty avatar Nov 17 '23 20:11 max-sixty

Sorry @dcherian I haven't had a chance to look at this in detail yet, but overall it seems really nice.

One random thought on the API: is there a way to configure options such that ds.to_zarr(..., compute=False) could just call this method, since it does effectively the same thing but much faster? This way users get an immediate performance improvement without having to change existing code, plus you can still use the standalone xr.initialize_zarr to get back a template dataset for region writes.

slevang avatar Jan 18 '24 16:01 slevang

Hi @dcherian, great work! This has solved one of the performance bottlenecks for us.

We are finding Xarray fairly slow in writing small Zarrs to S3 with many variables/DataArrays.

It seems slow in:

  • Creating the metadata (due to latency?), which your PR solves. I tested this.
  • Writing many chunks to S3 (due to latency?)
    • It's significantly faster to write to disk/memory then a parallel upload to s3.

Are there any plans to progress this PR. It would be good to have this merged.

petewa avatar Apr 02 '24 02:04 petewa

I share the perspective that it would be nicer to implement this within the existing to_zarr API rather than add a new top-level function to Xarray. @dcherian do you see a path to doing that?

rabernat avatar Apr 02 '24 15:04 rabernat

do you see a path to doing that?

Yes i too agree now, but it'll take some work (=time). I'm thinking we add a initialize method to ZarrStore and use that in store (?).

dcherian avatar Apr 02 '24 15:04 dcherian

I'm thinking we add a initialize method to ZarrStore and use that in store (?).

This makes sense to me. There's no reason for this not to be the default for all Zarr dataset creation. It will massively speed things up in most cases.

Thinking bigger, it should be possible to upstream some of this to Zarr itself. But that can come later.

rabernat avatar Apr 02 '24 15:04 rabernat

To the extent that it's helpful and not annoying to add "this would be really useful": this would be really useful! It seems fairly close IIUC (though easy for me to say...)

One reflection from encouraging more Zarr usage with colleagues is that it is a) really spectacular for reading data, b) still much more complicated to write data for the average dev relative to something like parquet. I think this would close a lot of the gap...

max-sixty avatar Jun 19 '24 18:06 max-sixty

I'm fine with getting this out into the wild, but I am hesitant to commit to it as permanent API. Could we say it is an experimental feature, don't depend on it in production, etc?

rabernat avatar Jun 20 '24 15:06 rabernat

I'm fine with getting this out into the wild, but I am hesitant to commit to it as permanent API. Could we say it is an experimental feature, don't depend on it in production, etc?

Definitely +1 from me

max-sixty avatar Jun 20 '24 17:06 max-sixty