[v3] support write_empty_chunks
Zarr version
3.0.0.beta
Numcodecs version
0.13
Python Version
3.11
Operating System
Mac
Installation
pip
Description
write_empty_chunks was a very useful write-time optimization that we should bring back to 3.0
https://zarr.readthedocs.io/en/stable/tutorial.html#empty-chunks
Steps to reproduce
In [23]: a = zarr.create(shape=(10, 10), chunks=(5, 5), write_empty_chunks
...: =True)
/Users/jhamman/miniforge3/envs/icechunk-demo/lib/python3.12/site-packages/zarr/core/sync.py:100: RuntimeWarning: write_empty_chunks is not yet implemented
Additional output
No response
I opened #2429 to address this issue, but I think we should maybe take some time here to discuss what kind of API we want before moving forward with that PR. I think write_empty_chunks is just one of many possible runtime properties that we might want to support, and we should probably settle on a coherent strategy for all of them.
problems with runtime stuff as simple Array attributes.
The zarr-python 2 solution to this problem was to create a runtime attribute of the Array class, namely Array.write_empty_chunks, that would influence the the chunk-writing behavior of a given array. I have a few problems with this approach:
- You can only specify the
write_empty_chunksparameter when you are creating a zarr array with a function call. If you are using something likeGroup.__getitem__ / __setitem__to get an array handle, then there's no way to specifywrite_empty_chunks(or any other runtime properties of the array). - Once you created your array, you were locked in to a specific setting of
write_empty_chunks. The attribute itself could of course be mutated after calling__init__but this is underhanded stuff we want to avoid. There was no easy way to create a new array handle with a differentwrite_empty_chunksattribute.
putting runtime stuff in a global config
Those are my main complaints with the simple approach we took in zarr-python v2. In #2429 I'm opting for a very different approach by making write_empty_chunks a global config option that can be locally overridden in the context of individual array write operations. Instead of making write_empty_chunks an attribute of an array, it's an attribute of an array writing context. I see advantages to this approach, but it might also be net bad -- it might not match the way users think about write_empty_chunks. In particular, if the mental model is that a zarr array is "write-empty-chunks-typed", then disconnecting write_empty_chunks from array instances will be a frustrating violation of expectations.
another idea: Array-local config
What if we do something in between -- we give arrays a config or runtime attribute that's scoped to an individual array instance. A few behaviors I'm imagining:
Array.configis an immutable dataclass that can be defined from a dict.-
- The
configattribute can evolve without cluttering the array creation constructor
- The
- If the array config is not specified, one is created from the global config values.
- An array config can be defined for groups; unless overridden, arrays created from these groups will inherit that config. This solves the problem of controlling the
write_empty_chunksattribute of arrays retrieved viagroup.__getitem__. - Once created, arrays ignore the global config values and always use the config defined on the array instance. This reduces surprising "action at a distance" for array operations.
- Array handles with different config values are cheap to create, e.g.
array.with_config({...}), which returns a new array object with a different config arrangement.
Does this seem reasonable? If so, I would steer #2429 in this direction. If not, I'm curious to hear other ideas.
Here's the situation I was objecting to in #2429:
with config.set({'array.write_empty_chunks': True}):
arr = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=0)
arr[:] = 0 # will this use write_empty_chunks or not? how would I know?
Ultimately, I think the user experience is best if they have a way to check how an array is configured. Array.write_empty_chunks (or Array.config.write_empty_chunks as proposed below) provides that ability. In general, I like config for defaults but we should try to make it possible for users to override the config context by directly passing things like write_empty_chunks to the Array constructor.
tldr; I like the idea of a Array.config dataclass. I suspect there will be more options like this as we grow applications and extensions.
Array handles with different config values are cheap to create, e.g. array.with_config({...}), which returns a new array object with a different config arrangement.
I think we could hold off on this part of the proposal for now if you want reduce scope. Not saying its
Here's the situation I was objecting to in #2429:
with config.set({'array.write_empty_chunks': True}): arr = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=0) arr[:] = 0 # will this use write_empty_chunks or not? how would I know?
In a universe where we take the path of #2429, users with zarr-python 2 experience would have to change their expectations about how write_empty_chunks works; instead of expecting a write_empty_chunks array attribute, they would need to expect that "write empty chunksness" is a runtime context thing, and so the flow would look like this:
arr1 = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=0)
arr2 = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=1)
default = config.get('array.write_empty_chunks')
print(f'the current default for write_empy_chunks is {default}')
arr1[:] = 0 # this will use the current default
# this will use the opposite of the default, for BOTH array writes
with config.set({'array.write_empty_chunks': not default}):
arr1[:] = 0
arr2[:] = 1
I'm not advocating that we do this, just sketching out how it would work. There are definitely pros and cons, and perhaps the largest is requiring a new mental model for array runtime properties. Maybe the array.config idea gives us a nice balance.