ome-zarr-py
ome-zarr-py copied to clipboard
write_multiscale() writes to local path on Linux instead of cloud bucket when using dask>=2023.1.0
Please see this issue https://github.com/dask/dask/issues/10324 , which is the source of this problem
Here is an MRE:
import numpy as np
import dask.array as da
import zarr
from ome_zarr.writer import write_multiscale
store = zarr.storage.FSStore(url="s3://my-bucket/zarr-test.zarr", dimension_separator='/')
root_group = zarr.group(store=store, overwrite=True)
group = root_group.create_group(name="tile0")
d = da.ones(shape=(1, 1, 1000, 1000, 1000), dtype=np.uint16, chunks=(1, 1, 100, 100, 100))
pyramid = [d]
write_multiscale(pyramid, group, chunks=(1, 1, 100, 100, 100), compute=True)
The issue is that when an FSStore
is passed to dask.array.to_zarr(url=group.store)
and storage_options
is non-empty, this code path is used https://github.com/dask/dask/blob/596d74dbc72db663efac606d42d5c93c5a917fb9/dask/array/core.py#L3710
passing an FSStore
instance to the FSStore
constructor on Linux results in the store being created in the current working directory, e.g., ./my-bucket/zarr-test.zarr
instead of the correct S3 path (this also true for GCS paths).
This does not occur on Windows 10.
Reverting to dask 2022.12.1
fixes the issue
From https://github.com/dask/dask/issues/10324#issuecomment-1570856656 my understanding is that if you've already created the store (as we have in ome-zarr-py) then you shouldn't need to specify any other storage options (because they should all be incorporated into the store itself, or in the case of chunks
, the chunking of the dask array itself will be used?
The addition of storage options to write_multiscale
and write_image
was in https://github.com/ome/ome-zarr-py/pull/161/, before support of dask was added.
I don't know what the use-case is for including these options in da.to_zarr()
- what options might be useful/needed at that point. The chunks are already used to rechunk the data, so they may not be necessary to include in options.
I agree that specifying chunks
in storage_options
is not necessary, since that option is ignored by the FSStore constructor and the zarr chunks are based on the dask array chunks.
If the intended / future behaviour of dask at this point (see https://github.com/dask/dask/issues/10324#issuecomment-1570856656) is to "raise an informative error when url is an FSStore instance and storage_options are given" then we need to not pass any storage options. That means that we already need to use the storage_options when we create the store.
The call stack for that is:
parse_url()
ZarrLocation.init
format.init_store()
store = FSStore()
So you'd need to pass storage_options
to the creation of the store, either if you do it yourself (as in the example above) or in parse_url()
.
This would mean possibly removing the storage_options
from write_multiscale()
and write_image()
as added in #161.
This would mean NOT going ahead with the deprecation of 'chunks' at https://github.com/ome/ome-zarr-py/issues/167
Presumably we should also remove the line at https://github.com/ome/ome-zarr-py/blob/685eb89ab009560f5e41bd7cd0b7c067d91e0a2e/ome_zarr/writer.py#L251 - which is the source of the error from the code above since that option is redundant after rechunking and means that options
is not empty.
https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.FSStore
cc @sbesson
Interestingly, I just reproduced the same issue both on Windows and on Linux with dask 2023.6.0 when trying to convert a microscopy time series to zarr (TCZYX =~ (1000, 1, 150, 2000, 1000)). Just like @carshadi said, Reverting to dask 2022.12.1 gets me up and running again.