zarr-python [v3] support write_empty

Zarr version

3.0.0.beta

Numcodecs version

0.13

Python Version

3.11

Operating System

Mac

Installation

pip

Description

write_empty_chunks was a very useful write-time optimization that we should bring back to 3.0

https://zarr.readthedocs.io/en/stable/tutorial.html#empty-chunks

Steps to reproduce

In [23]: a = zarr.create(shape=(10, 10), chunks=(5, 5), write_empty_chunks
    ...: =True)
/Users/jhamman/miniforge3/envs/icechunk-demo/lib/python3.12/site-packages/zarr/core/sync.py:100: RuntimeWarning: write_empty_chunks is not yet implemented

Additional output

No response

Oct 18 '24 20:10 jhamman

I opened #2429 to address this issue, but I think we should maybe take some time here to discuss what kind of API we want before moving forward with that PR. I think write_empty_chunks is just one of many possible runtime properties that we might want to support, and we should probably settle on a coherent strategy for all of them.

problems with runtime stuff as simple `Array` attributes.

The zarr-python 2 solution to this problem was to create a runtime attribute of the Array class, namely Array.write_empty_chunks, that would influence the the chunk-writing behavior of a given array. I have a few problems with this approach:

You can only specify the write_empty_chunks parameter when you are creating a zarr array with a function call. If you are using something like Group.__getitem__ / __setitem__ to get an array handle, then there's no way to specify write_empty_chunks (or any other runtime properties of the array).
Once you created your array, you were locked in to a specific setting of write_empty_chunks. The attribute itself could of course be mutated after calling __init__ but this is underhanded stuff we want to avoid. There was no easy way to create a new array handle with a different write_empty_chunks attribute.

putting runtime stuff in a global config

Those are my main complaints with the simple approach we took in zarr-python v2. In #2429 I'm opting for a very different approach by making write_empty_chunks a global config option that can be locally overridden in the context of individual array write operations. Instead of making write_empty_chunks an attribute of an array, it's an attribute of an array writing context. I see advantages to this approach, but it might also be net bad -- it might not match the way users think about write_empty_chunks. In particular, if the mental model is that a zarr array is "write-empty-chunks-typed", then disconnecting write_empty_chunks from array instances will be a frustrating violation of expectations.

another idea: Array-local config

What if we do something in between -- we give arrays a config or runtime attribute that's scoped to an individual array instance. A few behaviors I'm imagining:

Array.config is an immutable dataclass that can be defined from a dict.
- The config attribute can evolve without cluttering the array creation constructor
If the array config is not specified, one is created from the global config values.
An array config can be defined for groups; unless overridden, arrays created from these groups will inherit that config. This solves the problem of controlling the write_empty_chunks attribute of arrays retrieved via group.__getitem__.
Once created, arrays ignore the global config values and always use the config defined on the array instance. This reduces surprising "action at a distance" for array operations.
Array handles with different config values are cheap to create, e.g. array.with_config({...}), which returns a new array object with a different config arrangement.

Does this seem reasonable? If so, I would steer #2429 in this direction. If not, I'm curious to hear other ideas.

Oct 23 '24 10:10 d-v-b

Here's the situation I was objecting to in #2429:

with config.set({'array.write_empty_chunks': True}):
    arr = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=0)

arr[:] = 0  # will this use write_empty_chunks or not? how would I know?

Ultimately, I think the user experience is best if they have a way to check how an array is configured. Array.write_empty_chunks (or Array.config.write_empty_chunks as proposed below) provides that ability. In general, I like config for defaults but we should try to make it possible for users to override the config context by directly passing things like write_empty_chunks to the Array constructor.

tldr; I like the idea of a Array.config dataclass. I suspect there will be more options like this as we grow applications and extensions.

Array handles with different config values are cheap to create, e.g. array.with_config({...}), which returns a new array object with a different config arrangement.

I think we could hold off on this part of the proposal for now if you want reduce scope. Not saying its

Oct 23 '24 13:10 jhamman

Here's the situation I was objecting to in #2429:

with config.set({'array.write_empty_chunks': True}):
    arr = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=0)

arr[:] = 0  # will this use write_empty_chunks or not? how would I know?

In a universe where we take the path of #2429, users with zarr-python 2 experience would have to change their expectations about how write_empty_chunks works; instead of expecting a write_empty_chunks array attribute, they would need to expect that "write empty chunksness" is a runtime context thing, and so the flow would look like this:

arr1 = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=0)
arr2 = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=1)

default =  config.get('array.write_empty_chunks')
print(f'the current default for write_empy_chunks is {default}')
arr1[:] = 0  # this will use the current default

# this will use the opposite of the default, for BOTH array writes
with config.set({'array.write_empty_chunks': not default}):
  arr1[:] = 0  
  arr2[:] = 1

I'm not advocating that we do this, just sketching out how it would work. There are definitely pros and cons, and perhaps the largest is requiring a new mental model for array runtime properties. Maybe the array.config idea gives us a nice balance.

Oct 23 '24 13:10 d-v-b

[v3] support write_empty_chunks

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Additional output

problems with runtime stuff as simple Array attributes.

putting runtime stuff in a global config

another idea: Array-local config

problems with runtime stuff as simple `Array` attributes.