zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

write behavior for empty chunks

Open d-v-b opened this issue 7 months ago • 4 comments

In v2, at array access time it is possible to set whether empty chunks (defined as chunks that are entirely fill_value) should be written to storage or skipped. This is an extremely useful feature for high-latency storage backends, or in any context where too many objects in storage is burdensome.

We don't support this in v3 yet, but we should. How should we do it? I will throw out a few options in order of practicality:

  • emulate v2: provide a keyword argument like write_empty_chunks when accessing an array. All chunk writes from that array will be affected.
  • put the write_empty_chunks setting in a global config. All chunk writes from all arrays in a session will be affected by the config parameter.
  • design an API for array IO wherein IO is wrapped in a context that can be parametrized, e.g. with a context manager, and one of those parameters is the write_empty_chunks-ness of the write transaction. Highly speculative.

The first option seems pretty expedient, and I don't think we had a lot of problems with this approach in v2. The only drawback is that if people want the same array to exhibit conditional write_empty_chunks behavior, then they might need something like the second approach, which has its own drawbacks IMO (i'm not a big fan of mutable global state).

I would propose that we emulate v2 for now (i.e., make write_empty_chunks a keyword argument to array access) and note any friction this causes, and consider ways to alleviate that in a subsequent design refresh if the friction is severe.

cc @constantinpape

d-v-b avatar Jul 05 '24 07:07 d-v-b