zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

[v3] support write_empty_chunks

Open jhamman opened this issue 1 year ago • 3 comments

Zarr version

3.0.0.beta

Numcodecs version

0.13

Python Version

3.11

Operating System

Mac

Installation

pip

Description

write_empty_chunks was a very useful write-time optimization that we should bring back to 3.0

https://zarr.readthedocs.io/en/stable/tutorial.html#empty-chunks

Steps to reproduce

In [23]: a = zarr.create(shape=(10, 10), chunks=(5, 5), write_empty_chunks
    ...: =True)
/Users/jhamman/miniforge3/envs/icechunk-demo/lib/python3.12/site-packages/zarr/core/sync.py:100: RuntimeWarning: write_empty_chunks is not yet implemented

Additional output

No response

jhamman avatar Oct 18 '24 20:10 jhamman

I opened #2429 to address this issue, but I think we should maybe take some time here to discuss what kind of API we want before moving forward with that PR. I think write_empty_chunks is just one of many possible runtime properties that we might want to support, and we should probably settle on a coherent strategy for all of them.

problems with runtime stuff as simple Array attributes.

The zarr-python 2 solution to this problem was to create a runtime attribute of the Array class, namely Array.write_empty_chunks, that would influence the the chunk-writing behavior of a given array. I have a few problems with this approach:

  • You can only specify the write_empty_chunks parameter when you are creating a zarr array with a function call. If you are using something like Group.__getitem__ / __setitem__ to get an array handle, then there's no way to specify write_empty_chunks (or any other runtime properties of the array).
  • Once you created your array, you were locked in to a specific setting of write_empty_chunks. The attribute itself could of course be mutated after calling __init__ but this is underhanded stuff we want to avoid. There was no easy way to create a new array handle with a different write_empty_chunks attribute.

putting runtime stuff in a global config

Those are my main complaints with the simple approach we took in zarr-python v2. In #2429 I'm opting for a very different approach by making write_empty_chunks a global config option that can be locally overridden in the context of individual array write operations. Instead of making write_empty_chunks an attribute of an array, it's an attribute of an array writing context. I see advantages to this approach, but it might also be net bad -- it might not match the way users think about write_empty_chunks. In particular, if the mental model is that a zarr array is "write-empty-chunks-typed", then disconnecting write_empty_chunks from array instances will be a frustrating violation of expectations.

another idea: Array-local config

What if we do something in between -- we give arrays a config or runtime attribute that's scoped to an individual array instance. A few behaviors I'm imagining:

  • Array.config is an immutable dataclass that can be defined from a dict.
    • The config attribute can evolve without cluttering the array creation constructor
  • If the array config is not specified, one is created from the global config values.
  • An array config can be defined for groups; unless overridden, arrays created from these groups will inherit that config. This solves the problem of controlling the write_empty_chunks attribute of arrays retrieved via group.__getitem__.
  • Once created, arrays ignore the global config values and always use the config defined on the array instance. This reduces surprising "action at a distance" for array operations.
  • Array handles with different config values are cheap to create, e.g. array.with_config({...}), which returns a new array object with a different config arrangement.

Does this seem reasonable? If so, I would steer #2429 in this direction. If not, I'm curious to hear other ideas.

d-v-b avatar Oct 23 '24 10:10 d-v-b

Here's the situation I was objecting to in #2429:

with config.set({'array.write_empty_chunks': True}):
    arr = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=0)

arr[:] = 0  # will this use write_empty_chunks or not? how would I know?

Ultimately, I think the user experience is best if they have a way to check how an array is configured. Array.write_empty_chunks (or Array.config.write_empty_chunks as proposed below) provides that ability. In general, I like config for defaults but we should try to make it possible for users to override the config context by directly passing things like write_empty_chunks to the Array constructor.


tldr; I like the idea of a Array.config dataclass. I suspect there will be more options like this as we grow applications and extensions.

Array handles with different config values are cheap to create, e.g. array.with_config({...}), which returns a new array object with a different config arrangement.

I think we could hold off on this part of the proposal for now if you want reduce scope. Not saying its

jhamman avatar Oct 23 '24 13:10 jhamman

Here's the situation I was objecting to in #2429:

with config.set({'array.write_empty_chunks': True}):
    arr = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=0)

arr[:] = 0  # will this use write_empty_chunks or not? how would I know?

In a universe where we take the path of #2429, users with zarr-python 2 experience would have to change their expectations about how write_empty_chunks works; instead of expecting a write_empty_chunks array attribute, they would need to expect that "write empty chunksness" is a runtime context thing, and so the flow would look like this:

arr1 = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=0)
arr2 = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=1)

default =  config.get('array.write_empty_chunks')
print(f'the current default for write_empy_chunks is {default}')
arr1[:] = 0  # this will use the current default

# this will use the opposite of the default, for BOTH array writes
with config.set({'array.write_empty_chunks': not default}):
  arr1[:] = 0  
  arr2[:] = 1

I'm not advocating that we do this, just sketching out how it would work. There are definitely pros and cons, and perhaps the largest is requiring a new mental model for array runtime properties. Maybe the array.config idea gives us a nice balance.

d-v-b avatar Oct 23 '24 13:10 d-v-b