zarr-python When I repeatedly append data to a zarr array in s3, appending takes longer and longer and longer

Minimal, reproducible code sample, a copy-pastable example if possible

#!/usr/bin/env python3

###
#  Please run me in AWS! The command runs as:
# 
#    python3 loop_zarr.py  $YOUR_S3_BUCKET_HERE $YOUR_PREFIX_HERE > loop_zarr.txt
#
# This will take a while to run. The timing issue only becomes really clear after the code runs for a few hours.
###

import sys
import s3fs
import argparse
import datetime as dt
from textwrap import dedent
import zarr
import numpy as np


def main(argv):
    description = dedent(
        """\
        This script replicates the issues we see with repeatedly appending to an array
        in s3.
        """
    )
    parser = argparse.ArgumentParser(
        description=description, formatter_class=argparse.RawTextHelpFormatter
    )
    parser.add_argument("s3_bucket", help="s3 bucket for data")
    parser.add_argument("s3_path", help="path prefix in the bucket for the store")

    args = parser.parse_args(argv)

    # number of 8-byte floats in a chunk
    chunk_size = 2500
    # number of chunks to write each iteration before writing the timing
    num_chunks = 500
    # number of iterations for the loop
    num_iterations = 500

    # initialize the zarr store
    time_str = dt.datetime.now().strftime("%Y-%m-%d__%H-%M-%S.%f__")
    path = f"{args.s3_bucket}/{args.s3_path}{time_str}"
    print(f"Writing zarr data under {path}", flush=True)
    var_name = "var"
    print(f"Chunk size (in floats): {chunk_size}", flush=True)
    print(f"Number of chunks written per iteration: {num_chunks}", flush=True)
    print(f"Number of iterations: {num_iterations}", flush=True)
    # note: 8-byte floats
    total_bytes = chunk_size * 8 * num_chunks * num_iterations
    print(f"Total number of gigabytes to be written: {total_bytes/1e9}", flush=True)

    create_dataset(path, var_name, chunk_size)
    print("initialized zarr", flush=True)

    # Create some data to write out. We don't actually care what it is.
    chunk = np.random.normal(size=chunk_size)

    # Iterate through
    for i in range(num_iterations):
        start_iteration = dt.datetime.now()
        for j in range(num_chunks):
            append_data(path, var_name, chunk)
        iteration_length = (dt.datetime.now() - start_iteration).total_seconds()
        print(f"Done iteration {i}, {iteration_length} seconds", flush=True)


def create_dataset(path, var_name, chunk_size):
    s3 = s3fs.S3FileSystem()
    store = s3fs.S3Map(root=path, s3=s3, check=False)
    root = zarr.open(store, mode="w")
    root.create_dataset(var_name, shape=(0), chunks=(chunk_size), dtype=np.float64)


def append_data(path, var_name, chunk):
    s3 = s3fs.S3FileSystem()
    store = s3fs.S3Map(root=path, s3=s3, check=False)
    root = zarr.open(store, mode="r+")
    root[var_name].append(chunk)


if __name__ == "__main__":
    main(sys.argv[1:])

Problem description

When we repeatedly append to an array in s3, the time zarr takes to append the new chunk takes longer and longer and longer. While we would normally expect this to happen if zarr had to rearrange what it had already written, we don't think that's what zarr does. As far as we can tell, zarr only needs to update the .zarray object before writing the new chunk of data into a new object. That .zarray object is very small and does not seem to increase in size as the array increases in size. So we expected that appending a new chunk to an existing array would be quick and that appending the 3rd chunk would take approximately as long as appending the 300th chunk. This is why we think we're seeing a bug.

The obvious solution is to just initialize the array to its full size. We've tried this and it does indeed work. So we do have a work-around. The issue is that our datasets do naturally grow because we get more data as time passes. Pre-allocating the array means we have to initialize the array to be extremely large and then keep track of its actual size somewhere else, which I think you will agree is a bit of a kluge, especially if we'd like to make these zarr stores available to third parties.

You may wonder why I repeatedly re-open the zarr store. The actual code we ran when we discovered this issue appended data periodically as data came in, so we were re-opening the zarr store repeatedly to write new data. I also am not sure what kind of caching/buffering zarr and/or s3fs may be doing. By allowing the zarr store to go out of scope, everything is presumably getting cleaned up and any cache/buffer is getting flushed.

Version and installation information

Please provide the following:

Value of zarr.__version__: 2.10.3
Value of numcodecs.__version__: 0.9.1
Version of Python interpreter: Python 3.7.10
Operating system (Linux/Windows/Mac): Linux (Amazon Linux 2)
How Zarr was installed (e.g., "using pip into virtual environment", or "using conda"): pip3

Also, if you think it might be relevant, please provide the output from pip freeze or conda env export depending on which was used to install Zarr.

asciitree==0.3.3
aws-cfn-bootstrap==2.0
boto3==1.10.40
botocore==1.13.50
docutils==0.14
fasteners==0.16.3
fsspec==2022.1.0
jmespath==0.10.0
lockfile==0.11.0
numcodecs==0.9.1
numpy==1.19.1
pystache==0.5.4
python-daemon==2.2.3
python-dateutil==2.8.2
s3fs==0.4.2
s3transfer==0.2.1
simplejson==3.2.0
six==1.16.0
urllib3==1.25.11
zarr==2.10.3

Jan 14 '22 14:01 christine-e-smit

I ran the above script last night. It appears to have stopped before doing 500 iterations, but you can at least see the trend.

loop_zarr.txt

Jan 14 '22 15:01 christine-e-smit

My hunch would be that this is related to s3fs directory listing and caching. @martindurant any thoughts?

Jan 14 '22 15:01 rabernat

I recommend turning on debug for the logger "s3fs" to see what S3 processes, if any, are taking up time. I'm not sure if zarr would be calling any listing operations, but that would be an obvious thing that gets longer as more files accrue.

Jan 14 '22 15:01 martindurant

On it. Let me see that s3fs has to say. Thank you.

Jan 14 '22 16:01 christine-e-smit

First, I don't think s3fs is doing directories listings on the variable path. It only does directory listings at the higher level with the .zgroup. So the problem does not seem to be that s3fs is having to deal with more and more objects under the variable path.

Instead, the issue seems to be around writing the .zarray file.

Every time my code appends some new data, there are 14 new log messages from s3fs. Here are the last 14 messages from the s3fs log of the run I did last weekend. If you look carefully at the time stamps, you can see where the jump in time is:

2022-01-15 16:58:17,681 - s3fs - DEBUG - Get directory listing page for 796a3c10-4335-412d-9832-28b5ff3162f4/loop_zarr2022-01-14__18-13-56.018829__
2022-01-15 16:58:17,701 - s3fs - DEBUG - Fetch: 796a3c10-4335-412d-9832-28b5ff3162f4/loop_zarr2022-01-14__18-13-56.018829__/.zgroup, 0-24
2022-01-15 16:58:17,721 - s3fs - DEBUG - CALL: head_object - ({},) - {'Bucket': '796a3c10-4335-412d-9832-28b5ff3162f4', 'Key': 'loop_zarr2022-01-14__18-13-56.018829__/var/.zarray'}
2022-01-15 16:58:17,739 - s3fs - DEBUG - CALL: head_object - ({},) - {'Bucket': '796a3c10-4335-412d-9832-28b5ff3162f4', 'Key': 'loop_zarr2022-01-14__18-13-56.018829__/var/.zarray'}
2022-01-15 16:58:17,757 - s3fs - DEBUG - CALL: head_object - ({},) - {'Bucket': '796a3c10-4335-412d-9832-28b5ff3162f4', 'Key': 'loop_zarr2022-01-14__18-13-56.018829__/var/.zarray'}
2022-01-15 16:58:17,774 - s3fs - DEBUG - Fetch: 796a3c10-4335-412d-9832-28b5ff3162f4/loop_zarr2022-01-14__18-13-56.018829__/var/.zarray, 0-322
2022-01-15 16:58:17,797 - s3fs - DEBUG - Upload for <File-like object S3FileSystem, 796a3c10-4335-412d-9832-28b5ff3162f4/loop_zarr2022-01-14__18-13-56.018829__/var/.zarray>, final=True, loc=322, buffer loc=322
2022-01-15 16:58:17,797 - s3fs - DEBUG - Commit <File-like object S3FileSystem, 796a3c10-4335-412d-9832-28b5ff3162f4/loop_zarr2022-01-14__18-13-56.018829__/var/.zarray>
2022-01-15 16:58:17,797 - s3fs - DEBUG - One-shot upload of <File-like object S3FileSystem, 796a3c10-4335-412d-9832-28b5ff3162f4/loop_zarr2022-01-14__18-13-56.018829__/var/.zarray>
2022-01-15 16:58:17,797 - s3fs - DEBUG - CALL: put_object - ({'cache_options': None},) - {'Key': 'loop_zarr2022-01-14__18-13-56.018829__/var/.zarray', 'Bucket': '796a3c10-4335-412d-9832-28b5ff3162f4'}
2022-01-15 16:58:18,061 - s3fs - DEBUG - Upload for <File-like object S3FileSystem, 796a3c10-4335-412d-9832-28b5ff3162f4/loop_zarr2022-01-14__18-13-56.018829__/var/249999>, final=True, loc=19185, buffer loc=19185
2022-01-15 16:58:18,061 - s3fs - DEBUG - Commit <File-like object S3FileSystem, 796a3c10-4335-412d-9832-28b5ff3162f4/loop_zarr2022-01-14__18-13-56.018829__/var/249999>
2022-01-15 16:58:18,061 - s3fs - DEBUG - One-shot upload of <File-like object S3FileSystem, 796a3c10-4335-412d-9832-28b5ff3162f4/loop_zarr2022-01-14__18-13-56.018829__/var/249999>
2022-01-15 16:58:18,061 - s3fs - DEBUG - CALL: put_object - ({'cache_options': None},) - {'Key': 'loop_zarr2022-01-14__18-13-56.018829__/var/249999', 'Bucket': '796a3c10-4335-412d-9832-28b5ff3162f4'}

When I plot the time between log messages in the last few chunks of 14 messages vs. the first few chunks of 14 messages, the problem is clearly between the following two messages:

2022-01-15 16:58:17,797 - s3fs - DEBUG - CALL: put_object - ({'cache_options': None},) - {'Key': 'loop_zarr2022-01-14__18-13-56.018829__/var/.zarray', 'Bucket': '796a3c10-4335-412d-9832-28b5ff3162f4'}
2022-01-15 16:58:18,061 - s3fs - DEBUG - Upload for <File-like object S3FileSystem, 796a3c10-4335-412d-9832-28b5ff3162f4/loop_zarr2022-01-14__18-13-56.018829__/var/249999>, final=True, loc=19185, buffer loc=19185

Here are the plots I did, showing the timing between messages at the beginning and end of the log:

first_140_messages last_140_messages

Jan 20 '22 15:01 christine-e-smit

I think that the next thing I'd be curious to try is to use s3fs directly to repeatedly overwrite an object in s3 and see what happens. I think I'll also try using the boto3 library directly to do the same thing and see what the timing is.

Jan 20 '22 17:01 christine-e-smit

An alternative to appending, if you know the final size of the array, would be to pre-allocated a large empty array and just write to the desired region.

You could also consider storing your metadata in a separate, faster store (e.g. redis, database, local file, etc), if this really turns out to be a bottleneck.

Jan 20 '22 17:01 rabernat

Perhaps it would be useful for you to add more logging to the _call_s3 method, to see at what point things are caught. There are ways to profile what's happening inside the async event loop / the state of fsspec's thread, but it would involve some work.

I wonder, what is your total event rate here? AWS imposes quotas or rate limits for some operations.

You could also consider storing your metadata in a separate, faster store (e.g. redis, database, local file, etc), if this really turns out to be a bottleneck.

If this is in one process, it can even be just a dict, and you can upload it when done. Zarr allows you to decouple the data and metadata stores.

Jan 20 '22 17:01 martindurant

@rabernat - We unfortunately do not have a final size for the array. Data keeps coming in until the mission is over. We did try pre-allocating an enormous array that was much bigger than our real datasets could ever grow to. This did sort out the timing issue, but made us worry about where to store the real size of the array.

@martindurant / @rabernat - storing the metadata somewhere else where zarr can find it may be the solution. I did not know that was possible and I will look into it. Thank you.

@martindurant - I think that aws write limits are possible, but unlikely. I just checked the throttling limits and my script is nowhere near them. I created a bucket just for this testing, so it's not as though anything else is writing there. The other possibility is network throttling since I'm using a kind of dinky EC2 for this test. But then I'd expect all the writes to be throttled. And they aren't. It's just the .zarray update.

Jan 20 '22 17:01 christine-e-smit

Grasping at straws a bit, is versioning turned on?

From https://docs.aws.amazon.com/AmazonS3/latest/userguide/versioning-workflows.html:

If you notice a significant increase in the number of HTTP 503-slow down responses received for Amazon S3 PUT or DELETE object requests to a bucket that has S3 Versioning enabled, you might have one or more objects in the bucket for which there are millions of versions. For more information, see Significant Increases in HTTP 503 Responses to Amazon S3 Requests to Buckets with Versioning Enabled in the Troubleshooting section.

@martindurant: would the 503's be (silently) retried?

Jan 21 '22 07:01 joshmoore

@joshmoore - that is an interesting idea, but my bucket did not have versioning. This is good to know, however, if we ever decide to turn versioning on.

Jan 21 '22 14:01 christine-e-smit

Can you please check: if you end your python session, and then restart appending in a new process, do you still see the slowness?

Jan 21 '22 14:01 martindurant

@martindurant - Will do. Not sure if I have time to run it today, but I will get back to you.

Jan 21 '22 14:01 christine-e-smit

@martindurant - I appended to an existing array using a new process and it was just as slow as appending had been at the end of the first process. So I decided to wait a day and try appending again to see if there was some kind of temporary throttling happening in s3. Appending was still slow after it had been more than 12 hours since I'd last touched the zarr store. I even tried creating a brand new EC2 instance to append to the existing zarr store and that was also slow. (I have double and triple checked - we are definitely not using versioning in this bucket. So that is not the issue.)

It feels as though this must be something outside of zarr, outside of s3fs, and outside of python. I'm going to try that test where I just repeatedly write to an the same object using s3fs and then boto3. If those also show this slow down, I'm going to try contacting AWS support. My team has access to paid technical support.

Do you mind keeping the ticket open even though it seems as though this is most probably not zarr? If it is an AWS thing, others may run into the same issue and may find this thread helpful.

Jan 28 '22 22:01 christine-e-smit

Certainly, please let us know here anything you find out.

Jan 28 '22 22:01 martindurant

I am stumped. I haven't had a ton of time to spend on this issue over the last 6 weeks, but I've tried to replicate the timing issue by just writing to s3 using s3fs and I can't. My initial attempt repeatedly overwrote a junk object in s3. When that didn't show any issues, I wrote a script that reads an object that is basically a .zarray json object, updates a value in it, and writes it back out. This also doesn't show any slow down.

I ran this thing for a 1,000,000 read/write iterations, which took just over 24 hours. Here's what the time between the first 200 s3fs log messages looks like:

And the last 200 s3fs log messages:

My script's logs show the same thing. The iterations at the beginning:

2022-02-15 15:37:53,138 ~ INFO ~ Read iteration 1, 0.054799
2022-02-15 15:37:53,224 ~ INFO ~ Write iteration 1, 0.085725
2022-02-15 15:37:53,278 ~ INFO ~ Read iteration 2, 0.054058
2022-02-15 15:37:53,314 ~ INFO ~ Write iteration 2, 0.035522
2022-02-15 15:37:53,369 ~ INFO ~ Read iteration 3, 0.054958
2022-02-15 15:37:53,407 ~ INFO ~ Write iteration 3, 0.038470
2022-02-15 15:37:53,461 ~ INFO ~ Read iteration 4, 0.053090
2022-02-15 15:37:53,500 ~ INFO ~ Write iteration 4, 0.039143

and end:

2022-02-16 17:38:58,155 ~ INFO ~ Read iteration 999997, 0.080782
2022-02-16 17:38:58,210 ~ INFO ~ Write iteration 999997, 0.055429
2022-02-16 17:38:58,281 ~ INFO ~ Read iteration 999998, 0.069964
2022-02-16 17:38:58,321 ~ INFO ~ Write iteration 999998, 0.039948
2022-02-16 17:38:58,399 ~ INFO ~ Read iteration 999999, 0.078502
2022-02-16 17:38:58,441 ~ INFO ~ Write iteration 999999, 0.041670
2022-02-16 17:38:58,494 ~ INFO ~ Read iteration 1000000, 0.052139
2022-02-16 17:38:58,540 ~ INFO ~ Write iteration 1000000, 0.046405

My script, for reference:

#!/usr/bin/env python
import os
from re import I

# get the s3fs logging
os.environ["S3FS_LOGGING_LEVEL"] = "DEBUG"

import sys
import uuid
import logging
import s3fs
import argparse
import datetime as dt
from textwrap import dedent
import json


def get_logger():
    logger = logging.getLogger("{}:{}".format(__name__, str(uuid.uuid4())))
    format_ = logging.Formatter("%(asctime)s ~ %(levelname)s ~ %(message)s")
    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(format_)
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)

    return logger


def main(argv):
    logger = get_logger()

    description = dedent(
        """\
        This script is attempting to replicate the issues with writing zarr by
        repeatedly updating an object in s3.
        """
    )
    parser = argparse.ArgumentParser(
        description=description,
    )

    parser.add_argument("s3_url", help="Location to repeatedly write the object to")
    parser.add_argument("num_iterations", help="Number of times to write", type=int)
    parser.add_argument(
        "--profile", help="optional AWS profile for credentials", default=None
    )
    args = parser.parse_args(argv)

    # create the s3 file system resource for writing
    if args.profile is not None:
        fs = s3fs.S3FileSystem(profile=args.profile)
    else:
        fs = s3fs.S3FileSystem()

    # Create the basic metadata we're going to save out as json.
    dummy_metadata = {
        "chunks": [1],
        "compressor": {
            "blocksize": 0,
            "clevel": 5,
            "cname": "lz4",
            "id": "blosc",
            "shuffle": 1,
        },
        "dtype": "<M8[s]",
        "fill_value": 0,
        "filters": None,
        "order": "C",
        "shape": [0],
        "zarr_format": 2,
    }

    logger.info("Writing to %s", args.s3_url)

    # do an initial write of the metadata so we can read it out as part of the loop
    encoded = str.encode(json.dumps(dummy_metadata))
    with fs.open(args.s3_url, "wb") as f:
        f.write(encoded)

    for i in range(args.num_iterations):
        do_read_write(fs, args.s3_url, i + 1, logger)


def do_read_write(fs, s3_url, iteration_num, logger):

    # read
    now = dt.datetime.now()
    with fs.open(s3_url, "r") as f:
        read_str = f.read()
    read_time = (dt.datetime.now() - now).total_seconds()
    logger.info("Read iteration %d, %f", iteration_num, read_time)

    # then write
    metadata = json.loads(read_str)
    metadata["shape"] = iteration_num
    metadata_str = json.dumps(metadata)
    now = dt.datetime.now()
    with fs.open(s3_url, "w") as f:
        f.write(metadata_str)
    write_time = (dt.datetime.now() - now).total_seconds()
    logger.info("Write iteration %d, %f", iteration_num, write_time)


if __name__ == "__main__":
    main(sys.argv[1:])

Mar 09 '22 17:03 christine-e-smit

So I don't know how to replicate this timing issue outside of zarr.

Mar 09 '22 17:03 christine-e-smit

I'm not sure if I already suggested it, but it would be worthwhile looking at s3fs's logs to see whether zarr is calling list at any point - that might get slower as more chunks arrive.

Mar 09 '22 17:03 martindurant

Yeah. You actually did mention that before. As I said in https://github.com/zarr-developers/zarr-python/issues/938#issuecomment-1017655172, the logs do show that zarr lists the root directory repeatedly, which makes total sense since my script repeated opens that directory. But that directory has only one object - .zgroup. Zarr is not listing the /var/ directory, which has all the variable's objects.

Mar 09 '22 18:03 christine-e-smit

If I'm reading the zarr code correctly (big if), I think this is where zarr asks s3fs (in my case) to write out the .zarray file. So, in theory, this is where I think the slow down is happening:

https://github.com/zarr-developers/zarr-python/blob/594c441693220eb4302f60c21c6bb021426bfdaa/zarr/core.py#L253-L269

In my case self._store is presumably the S3Map I created in this block of code:

def append_data(path, var_name, chunk):
    s3 = s3fs.S3FileSystem()
    store = s3fs.S3Map(root=path, s3=s3, check=False)
    root = zarr.open(store, mode="r+")
    root[var_name].append(chunk)

I'm going to be honest. I do not understand was S3Map is. When I look at the documentation (https://s3fs.readthedocs.io/en/latest/api.html#s3fs.mapping.S3Map), I see:

which does not mean anything to me because I don't know why they'd mirror a different class and I don't know what previous class they are mirroring.

So in the script in https://github.com/zarr-developers/zarr-python/issues/938#issuecomment-1063149278, I just use the S3FileSystem directly to read and write stuff:

fs = s3fs.S3FileSystem(profile=args.profile)

... <STUFF> ...

    with fs.open(s3_url, "r") as f:
        read_str = f.read()

... <STUFF> ...

    with fs.open(s3_url, "w") as f:
        f.write(metadata_str)

If you can help me understand what S3Map is, I could try rewriting my script to use that part of s3fs to read and write files.

Mar 09 '22 18:03 christine-e-smit

S3Map is an FSMap, providing a dict-like view over storage. I suppose the doc is bad because s3.get_mapper(root) is the way to do this now. zarr interacts by getting and setting (getitem, setitem), which should just call cat and pipe on the filesystem.

Mar 09 '22 18:03 martindurant

Hmmmmm. I'm going cross-eyed looking at the logs from my just-call-s3fs logs and the zarr-calls-s3fs logs. It appears that I'm seeing the same pattern of 6 log messages when I read and then write the .zarray in just-call-s3fs logs. So yeah, using S3Map is causing the same log messages as using S3FileSystem directly. I don't think that running a test with S3Map is going to make a difference.

Mar 09 '22 19:03 christine-e-smit

I think I'm going to try a script that just asks zarr to reshape the array rather than appending. This will write the .zarray object presumably, but there won't be any chunks of data written, which is closer to what my s3fs-only script was doing.

The extra time in the zarr logs is between the following log messages:

2022-01-15 16:58:17,797 - s3fs - DEBUG - CALL: put_object - ({'cache_options': None},) - {'Key': 'loop_zarr2022-01-14__18-13-56.018829__/var/.zarray', 'Bucket': '796a3c10-4335-412d-9832-28b5ff3162f4'}
2022-01-15 16:58:18,061 - s3fs - DEBUG - Upload for <File-like object S3FileSystem, 796a3c10-4335-412d-9832-28b5ff3162f4/loop_zarr2022-01-14__18-13-56.018829__/var/249999>, final=True, loc=19185, buffer loc=19185

This is on the boundary in the log between the last message about writing the .zarray object and the first message about writing the new chunk object. So I think it would be interesting to see what happens if there are no new chunks.

Mar 09 '22 19:03 christine-e-smit

Thank you for your help and patience.

Mar 09 '22 19:03 christine-e-smit

I've run a few more tests and I'm unfortunately still not sure were the slow down is happening.

I ran a test where I resized the zarr array, but I didn't actually write to the array. So the number of objects in the bucket didn't change. --> There was no slow down! So, the slow down does appear to be related to the number of objects in the bucket or maybe in the directory where the .zarray object is.
I ran a test without zarr where I used s3fs to write lots and lots and lots of objects while also updating a single object over and over again. This was supposed to replicate what zarr does when I append data. --> There was no slow down. So, s3 can apparently handle repeatedly updating one object in a bucket that has more and more objects.
I ran a test with zarr where I appended data, but I did not keep re-opening the zarr object. This got rid of the s3fs log messages about listing directories (E.g. - 2022-01-15 16:58:17,681 - s3fs - DEBUG - Get directory listing page for 796a3c10-4335-412d-9832-28b5ff3162f4/loop_zarr2022-01-14__18-13-56.018829__). --> There was a slow down.

So, even though the log messages from s3fs don't say anything about listing directories, they still slow down. Here are the messages from the last append:

2022-04-02 14:08:54,933 - s3fs - DEBUG - CALL: head_object - ({},) - {'Bucket': '5293f6a0-c941-4e68-af31-3638ed274b4e', 'Key': 'open_store/data2022-04-01__18-35-04.419966__/var/.zarray'}
2022-04-02 14:08:54,956 - s3fs - DEBUG - CALL: head_object - ({},) - {'Bucket': '5293f6a0-c941-4e68-af31-3638ed274b4e', 'Key': 'open_store/data2022-04-01__18-35-04.419966__/var/.zarray'}
2022-04-02 14:08:54,974 - s3fs - DEBUG - CALL: head_object - ({},) - {'Bucket': '5293f6a0-c941-4e68-af31-3638ed274b4e', 'Key': 'open_store/data2022-04-01__18-35-04.419966__/var/.zarray'}
2022-04-02 14:08:54,991 - s3fs - DEBUG - Fetch: 5293f6a0-c941-4e68-af31-3638ed274b4e/open_store/data2022-04-01__18-35-04.419966__/var/.zarray, 0-322
2022-04-02 14:08:55,013 - s3fs - DEBUG - Upload for <File-like object S3FileSystem, 5293f6a0-c941-4e68-af31-3638ed274b4e/open_store/data2022-04-01__18-35-04.419966__/var/.zarray>, final=True, loc=322, buffer loc=322
2022-04-02 14:08:55,013 - s3fs - DEBUG - Commit <File-like object S3FileSystem, 5293f6a0-c941-4e68-af31-3638ed274b4e/open_store/data2022-04-01__18-35-04.419966__/var/.zarray>
2022-04-02 14:08:55,013 - s3fs - DEBUG - One-shot upload of <File-like object S3FileSystem, 5293f6a0-c941-4e68-af31-3638ed274b4e/open_store/data2022-04-01__18-35-04.419966__/var/.zarray>
2022-04-02 14:08:55,013 - s3fs - DEBUG - CALL: put_object - ({'cache_options': None},) - {'Key': 'open_store/data2022-04-01__18-35-04.419966__/var/.zarray', 'Bucket': '5293f6a0-c941-4e68-af31-3638ed274b4e'}
2022-04-02 14:08:55,242 - s3fs - DEBUG - Upload for <File-like object S3FileSystem, 5293f6a0-c941-4e68-af31-3638ed274b4e/open_store/data2022-04-01__18-35-04.419966__/var/249999>, final=True, loc=19126, buffer loc=19126
2022-04-02 14:08:55,242 - s3fs - DEBUG - Commit <File-like object S3FileSystem, 5293f6a0-c941-4e68-af31-3638ed274b4e/open_store/data2022-04-01__18-35-04.419966__/var/249999>
2022-04-02 14:08:55,242 - s3fs - DEBUG - One-shot upload of <File-like object S3FileSystem, 5293f6a0-c941-4e68-af31-3638ed274b4e/open_store/data2022-04-01__18-35-04.419966__/var/249999>
2022-04-02 14:08:55,242 - s3fs - DEBUG - CALL: put_object - ({'cache_options': None},) - {'Key': 'open_store/data2022-04-01__18-35-04.419966__/var/249999', 'Bucket': '5293f6a0-c941-4e68-af31-3638ed274b4e'}

Once again, the time jump happens between writing the .zarray and writing the data object:

2022-04-02 14:08:55,013 - s3fs - DEBUG - CALL: put_object - ({'cache_options': None},) - {'Key': 'open_store/data2022-04-01__18-35-04.419966__/var/.zarray', 'Bucket': '5293f6a0-c941-4e68-af31-3638ed274b4e'}
2022-04-02 14:08:55,242 - s3fs - DEBUG - Upload for <File-like object S3FileSystem, 5293f6a0-c941-4e68-af31-3638ed274b4e/open_store/data2022-04-01__18-35-04.419966__/var/249999>, final=True, loc=19126, buffer loc=19126

You can see the timing quite clearly here at the beginning vs. at the end. (Note that the repeat happens every 12 messages rather than 14 because it's not listing anything in s3 because I'm not reopening the zarr store.)

Apr 05 '22 21:04 christine-e-smit

Obviously, I accidentally swapped the x and y axis labels. Sorry:

Apr 05 '22 21:04 christine-e-smit

Hello! My colleague Hailiang is braver than I am and actually dove into zarr's code base. We think he found where the issue we're seeing lies. If you look at this section of code:

https://github.com/zarr-developers/zarr-python/blob/e381bf76261a1cc9c10613465ab39b968a24737b/zarr/core.py#L2430-L2441

in _resize_nosync(). The code is basically trying to figure out if any chunks need to be deleted as a result of the resize. The way it does this is by looping over all the old chunks in the old shape to see if they are still in the new shape. So, as the number of chunks grow, this loop takes longer and longer and loooooonger.

Suppose that the old shape is 10x20 chunks and the new shape is 9x30 chunks. If I'm understanding what the current code does, it will loop over those 200 old chunks and discard 20 chunks that are out of bounds of the new shape. But it feels as though this isn't the most efficient way of deleting chunks.

What my colleague's code suggestion does is loop over each dimension and figure out which chunks are clearly out of bounds. In the 10x20 --> 9x30 case, it looks at the first dimension and notices 10 > 9, so it flags the 20 chunks in the 10th row for deletion. It then looks at the second dimension and sees 20 < 30, so it doesn't need to delete any additional chunks.

With the proposed algorithm, if all you're doing is appending, the code will very quickly figure out that it doesn't have to delete anything. And, indeed, when I ran my scripts against this new version of the code, appending was very quick even with a very large array.

Not everyone will, of course, be appending when they try to reshape an array. But we think that actually looping over dimensions is generally still going to be faster than looping over all chunks. The only time we think looping over dimensions will be a performance issue is if someone creates an array with a huge number of dimensions and not very many chunks. So, if the array has a total of one chunk, but thousands of dimensions, this code change will be slower. Practically speaking, though, is this likely?

Hailiang and I both work at the GES DISC, one of NASA's earth science distributed active archive centers (DAACs). The data we deal with has three or four dimensions and is more or less constantly growing. The proposed change is a clear win for that type of data. You would have more of an understanding of what kind of data is used more broadly with zarr. Maybe our data is the unusual case and extremely high dimensional data is the norm.

Apr 26 '22 15:04 christine-e-smit

Thanks a lot for your very detailed update @christine-e-smit ! Christine has a very good comment on the performance concern when the maximum number of dimensions is very large (since I need to iterate over dimensions in my PR). Actually I talked to Christine just now, and thought probably this is not a big concern for now, since the limit to the number of dimensions for numpy arrays appear to be 32, but please correct us if this assumption is not valid.

Apr 26 '22 16:04 hailiangzhang

Yeah, I think assuming a small number of dimensions is very reasonable. Relatively fewdatasets will have more than 4 or 5 dimensions. Usually 2-3 spatial with possible extra dimensions for channels, time, batch, etc.

Apr 27 '22 02:04 grlee77

We did try pre-allocating an enormous array that was much bigger than our real datasets could ever grow to. This did sort out the timing issue, but made us worry about where to store the real size of the array.

Just a note on this - zarr doesn't write any data it doesn't need to. If you pre-allocated an enormous array, it won't write chunks of zeros at every possible index, it just won't create the chunk files (or directories, for the NestedDirectoryStore). Then if it goes looking for a file later and can't find it, it'll return a chunk full of a particular fill value. If you later need to "realise" that sparse array, you can shrink it to the bounds of your actual data.

Jun 30 '22 12:06 clbarnes

zarr-python zarr-python copied to clipboard

When I repeatedly append data to a zarr array in s3, appending takes longer and longer and longer

Minimal, reproducible code sample, a copy-pastable example if possible

Problem description

Version and installation information

zarr-python
zarr-python copied to clipboard