write behavior for empty chunks
In v2, at array access time it is possible to set whether empty chunks (defined as chunks that are entirely fill_value) should be written to storage or skipped. This is an extremely useful feature for high-latency storage backends, or in any context where too many objects in storage is burdensome.
We don't support this in v3 yet, but we should. How should we do it? I will throw out a few options in order of practicality:
- emulate v2: provide a keyword argument like
write_empty_chunkswhen accessing an array. All chunk writes from that array will be affected. - put the
write_empty_chunkssetting in a global config. All chunk writes from all arrays in a session will be affected by the config parameter. - design an API for array IO wherein IO is wrapped in a context that can be parametrized, e.g. with a context manager, and one of those parameters is the write_empty_chunks-ness of the write transaction. Highly speculative.
The first option seems pretty expedient, and I don't think we had a lot of problems with this approach in v2. The only drawback is that if people want the same array to exhibit conditional write_empty_chunks behavior, then they might need something like the second approach, which has its own drawbacks IMO (i'm not a big fan of mutable global state).
I would propose that we emulate v2 for now (i.e., make write_empty_chunks a keyword argument to array access) and note any friction this causes, and consider ways to alleviate that in a subsequent design refresh if the friction is severe.
cc @constantinpape
My 2cents: I personally think this should be the default behavior. We have use-cases where writing empty chunks is extremely bad (as in bringing us over IO node quota immediately and killing everyone's jobs) and I am hesitant to give a library to students which has this as a default behavior.
Besides this, I think that both option 1 and 2 sound ok to me.
My 2cents: I personally think this should be the default behavior.
I tend to agree, the one counter-point is that for dense arrays, i.e. those arrays where all chunks have useful data in them, inspecting each chunk adds some overhead to writing. But I think our design should err on the side of using up some extra CPU cycles over potentially generating massive numbers of useless files.
inspecting each chunk adds some overhead to writing
Here is some data on that from my laptop
| dtype | bytes | time (s) | throughput (MB/s) |
|---|---|---|---|
| i2 | 2000 | 1.40e-04 | 14.31 |
| i2 | 2000000 | 2.53e-04 | 7897.34 |
| i2 | 2000000000 | 6.26e-01 | 3195.49 |
| i4 | 4000 | 9.53e-05 | 41.98 |
| i4 | 4000000 | 2.01e-04 | 19863.44 |
| i4 | 4000000000 | 7.60e-01 | 5261.13 |
| i8 | 8000 | 3.64e-05 | 219.93 |
| i8 | 8000000 | 2.54e-04 | 31506.36 |
| i8 | 8000000000 | 1.24e+00 | 6441.56 |
| f4 | 4000 | 6.52e-05 | 61.38 |
| f4 | 4000000 | 2.60e-04 | 15382.13 |
| f4 | 4000000000 | 7.29e-01 | 5490.37 |
| f8 | 8000 | 4.01e-05 | 199.38 |
| f8 | 8000000 | 2.49e-04 | 32074.79 |
| f8 | 8000000000 | 1.19e+00 | 6706.42 |
Given that the throughput is many GB/s for larger chunk sizes, this seems unlikely to be a rate limiting step for most I/O bound workloads against disk or cloud storage. So I think it's a fine default.
Does it make sense to think of this as an Array -> Array codec which may just abort the entire writing pipeline?
import numpy as np
for dtype in ['i2', 'i4', 'i8', 'f4', 'f8']:
for n in [1000, 1_000_000, 1_000_000_000]:
data = np.zeros(n, dtype=dtype)
nbytes = data.nbytes
tic = perf_counter()
np.any(data > 0)
toc = perf_counter() - tic
throughput = nbytes / toc / 1e6
print(f"| {dtype} | {nbytes} | {toc:3.2e} | {throughput:4.2f} |")
xref: #2409
Was this fixed by https://github.com/zarr-developers/zarr-python/pull/2429?