ome-zarr-py icon indicating copy to clipboard operation
ome-zarr-py copied to clipboard

write_multiscale() writes to local path on Linux instead of cloud bucket when using dask>=2023.1.0

Open carshadi opened this issue 1 year ago • 4 comments

Please see this issue https://github.com/dask/dask/issues/10324 , which is the source of this problem

Here is an MRE:

import numpy as np
import dask.array as da
import zarr
from ome_zarr.writer import write_multiscale

store = zarr.storage.FSStore(url="s3://my-bucket/zarr-test.zarr", dimension_separator='/')
root_group = zarr.group(store=store, overwrite=True)
group = root_group.create_group(name="tile0")

d = da.ones(shape=(1, 1, 1000, 1000, 1000), dtype=np.uint16, chunks=(1, 1, 100, 100, 100))
pyramid = [d]

write_multiscale(pyramid, group, chunks=(1, 1, 100, 100, 100), compute=True)

The issue is that when an FSStore is passed to dask.array.to_zarr(url=group.store) and storage_options is non-empty, this code path is used https://github.com/dask/dask/blob/596d74dbc72db663efac606d42d5c93c5a917fb9/dask/array/core.py#L3710

passing an FSStore instance to the FSStore constructor on Linux results in the store being created in the current working directory, e.g., ./my-bucket/zarr-test.zarr instead of the correct S3 path (this also true for GCS paths).

This does not occur on Windows 10.

Reverting to dask 2022.12.1 fixes the issue

carshadi avatar Jun 05 '23 22:06 carshadi

From https://github.com/dask/dask/issues/10324#issuecomment-1570856656 my understanding is that if you've already created the store (as we have in ome-zarr-py) then you shouldn't need to specify any other storage options (because they should all be incorporated into the store itself, or in the case of chunks, the chunking of the dask array itself will be used?

The addition of storage options to write_multiscale and write_image was in https://github.com/ome/ome-zarr-py/pull/161/, before support of dask was added. I don't know what the use-case is for including these options in da.to_zarr() - what options might be useful/needed at that point. The chunks are already used to rechunk the data, so they may not be necessary to include in options.

will-moore avatar Jun 06 '23 10:06 will-moore

I agree that specifying chunks in storage_options is not necessary, since that option is ignored by the FSStore constructor and the zarr chunks are based on the dask array chunks.

carshadi avatar Jun 06 '23 16:06 carshadi

If the intended / future behaviour of dask at this point (see https://github.com/dask/dask/issues/10324#issuecomment-1570856656) is to "raise an informative error when url is an FSStore instance and storage_options are given" then we need to not pass any storage options. That means that we already need to use the storage_options when we create the store.

The call stack for that is:

parse_url()
ZarrLocation.init
format.init_store()
store = FSStore()

So you'd need to pass storage_options to the creation of the store, either if you do it yourself (as in the example above) or in parse_url().

This would mean possibly removing the storage_options from write_multiscale() and write_image() as added in #161.

This would mean NOT going ahead with the deprecation of 'chunks' at https://github.com/ome/ome-zarr-py/issues/167

Presumably we should also remove the line at https://github.com/ome/ome-zarr-py/blob/685eb89ab009560f5e41bd7cd0b7c067d91e0a2e/ome_zarr/writer.py#L251 - which is the source of the error from the code above since that option is redundant after rechunking and means that options is not empty.

https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.FSStore

cc @sbesson

will-moore avatar Jun 07 '23 10:06 will-moore

Interestingly, I just reproduced the same issue both on Windows and on Linux with dask 2023.6.0 when trying to convert a microscopy time series to zarr (TCZYX =~ (1000, 1, 150, 2000, 1000)). Just like @carshadi said, Reverting to dask 2022.12.1 gets me up and running again.

toloudis avatar Jun 12 '23 18:06 toloudis