zarr-python Rechunking zarr file on S3 becomes very slow as number of initialized chunks increase

Rechunking zarr file on S3 becomes very slow as number of initialized chunks increase

Open mpu-creare opened this issue 4 years ago • 9 comments

Problem: I have a file stored on S3 and I wanted to store a second copy with a different chunk structure (optimized for time-series reads instead of spatial reads). Trying to directly rechunk the data on S3 turned out to be too slow.

Question: Is there a better way to rechunk zarr files on S3 than my code below? E.g. am I missing a flag or something that will speed this up?

Code Sample: Unfortunately this code is not run-able because I don't want to share my data (and I also wrote this from memory since I've released the cloud server where I did the work.)

>>> import s3fs
>>> import zarr
>>> import time
>>> # Point to S3 zarr array
>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='us-west-1'))
>>> store = s3fs.S3Map(root=<my_zarr_file_on_s3, s3=s3, check=False)
>>> # Open the group with 'append' mode so that I can create new file and write to it
>>> root = zarr.open(store=store, mode="a")
>>> z = root["space_chunked"]
>>> z.shape
 (1600, 3600, 1800)
>>> z.chunks
(128, 128, 16)
>>> zt = root.create_dataset(
    "time_chunked",
    shape=z.shape,
    chunks=(16, 16, 128),
    dtype=z.dtype, 
    fill_value=z.fill_value
)
>>> for i in range(z.shape[0] //  z.chunks[0] + 1):
>>>     for j in range(z.shape[1] //  z.chunks[1] + 1):
>>>         for k in range(z.shape[2] //  z.chunks[2] + 1):
>>>             slc = tuple([slice(ijk * zt.chunks[ii], (ijk +1) * zt.chunks[ii]) for ii, ijk in enumerate([i, j, k])])
>>>             start = time.time()
>>>             zt[slc] = z[slc]
>>>             print("Updated in {}'.format(time.time() - start))

Results: To write the the initial chunks took on the order of ~0.3 seconds. Then it slowed down to the point where it was ~9 seconds per chunk. At that point I stopped the operation, copied the data to a cloud server, rechunked it there, and then uploaded the data to S3 again.

Discussion: I'm assuming zarr is doing a list operation or something like that during writes? This gets pretty slow on S3...

I described my simplest use-case above. I actually run into this as well when I'm trying to append to a dataset on S3, and while I'm creating a large processed dataset.

I'm using serverless (Lambda functions) for most of these operations, and they end up timing out after a while for some use cases when the number of stored chunks become too large. My hacked solution is to copy the chunks that will be updated locally to temporary storage, update the chunks there, then copy them back to S3.... MUCH faster. It would be nice if Zarr could do something similar under the hood given the correct flags (perhaps skipping some verification).

Version and installation information

Please provide the following:

Value of zarr.__version__: 2.3.2
Value of numcodecs.__version__: 0.6.4
Version of Python interpreter: 3.7
Version of s3fs: 0.4.2
Operating system (Linux/Windows/Mac): Linux
How Zarr was installed (e.g., "using pip into virtual environment", or "using conda"): both

Apr 27 '20 20:04 mpu-creare

Very good issue. There is a lot to unpack here. But I just wanted to point you to https://github.com/dask/s3fs/issues/285 -- this could likely be the source of some slowness in s3fs. Can you report your s3fs version as well?

Apr 27 '20 20:04 rabernat

Updated the issue above with s3fs version 0.4.2.

Thanks for pointing me to dask/s3fs#285 ... I will try next time I have a chance.

Apr 27 '20 22:04 mpu-creare

Is there any update on this? I'm facing similar issues even though https://github.com/fsspec/s3fs/issues/285 was merged.

Jan 05 '22 08:01 kw90

@kw90: how close is what you are doing to the code above?

Jan 06 '22 08:01 joshmoore

@kw90: how close is what you are doing to the code above?

I'm not rechunking the zarr file on S3, but simply creating datasets using the default chunk structure (for now - looking forward to use chunks optimized for time-series).

Writing files to the S3 key/value store using boto3 takes around 2s/file.
Creating a dataset for a file to S3 using Zarr and s3fs takes around 50s/file, while creating a dataset locally takes around 0.2s/file.

It's probably a completely different issue than this rechunking though.

Jan 06 '22 13:01 kw90

If, as @rabernat suggests, s3fs is doing expensive full directory listings, then using / as a dimension separator instead of the default . might give an immediate performance boost. I.e., store=FSStore(*args, **kwargs, dimension_separator='/'). @kw90 could you give this a try?

Jan 06 '22 16:01 d-v-b

@kw90 could you give this a try?

Thanks @d-v-b for the suggestion! Sure, I'll give this a try. However, I am not sure how to use the FSStore. Should this replace the S3Map in the following or serve as an intermediate layer between s3fs and Zarr?

s3 = s3fs.S3FileSystem(
    anon=False,
    use_ssl=True,
    key=os.getenv("S3_ACCESS_KEY"),
    secret=os.getenv("S3_SECRET_KEY"),
    client_kwargs={"endpoint_url": os.getenv("S3_ENDPOINT_URL")},
)
store = s3fs.S3Map(root="default", s3=s3, check=False)
cache = zarr.LRUStoreCache(store=store, max_size=2 ** 28)

Jan 06 '22 18:01 kw90

ping :grin:

Feb 14 '22 17:02 kw90

I think you would do something like this: store = FSStore('s3://bucket', dimension_separator = '/', storage_options = {'anon' : False, 'use_ssl' : True, ...}), with all your other kwargs to S3FileSystem in the storage_options kwarg.

Feb 14 '22 18:02 d-v-b

zarr-python zarr-python copied to clipboard

Rechunking zarr file on S3 becomes very slow as number of initialized chunks increase

Version and installation information

zarr-python
zarr-python copied to clipboard