Iterative Shard Writing ala TensorStore Transactions / TensorStore `chunk_layout.write_chunk`
Describe the new feature you'd like
Hi, I saw a comment comment on the ossci zulip by @mkitti where he raised the idea for creating a simplified version of TensorStore's transactions specifically with shards:
with shard.write_context() as shard_write:
for chunk in shard_write.inner_chunks():
shard_write[chunk] = calculation(chunk)
I've found TensorStore's transactions and iterating over their chunk_layout useful for memory bounded workloads. For example when downsampling datasets.
The following plot displays downsampling a fish of shape TCZYX (24, 2, 42, ~3.8K, ~13.1K) where we vary the "batch size" with ts.Transaction.
Here's some sample code I've used for this in TensorStore:
In this case, the batch would be the number of chunks we write at once in a single transaction.
downsampled = ts.downsample(
source_ts, downsample_factors=downsample_factors, method=method
)
step = target_ts.chunk_layout.write_chunk.shape[0]
for start in range(0, downsampled.shape[0], step):
with ts.Transaction() as txn:
target_with_txn = target_ts.with_transaction(txn)
downsampled_with_txn = downsampled.with_transaction(txn)
stop = min(start + step, downsampled.shape[0])
target_with_txn[start:stop].write(downsampled_with_txn[start:stop]).result()
As an aside I find this notation, and all of the result() calls, and indexing and ranging to be repulsive, but oh well. Maybe if there's a nice way to handle that under the hood within the context manager that'd be very cool.
I see there is this "regular grid" for chunk grids, but I have no idea how to use this or if it's even iterable like ts.chunk_layout.write_chunk or related to this concept.
Let me know your thoughts, maybe there's stuff already in the zarr-python codebase where I can do something like this manually, tyty.
✨Le Fishe✨
I think the basic idea is there. You do a bunch of operarions with the write context, but no I/O happens until the context closes. When the context closes there is an opportunity to coalesce I/O operations. For example, write the entire shard all at once rather than individual write operations for individual chunks.
Just a few asides:
-
If I'm iterating in tensorstore, I might use an
IndexDomain. It would be nice to have chunk domain iterator here as well. -
The alternative to using
.result()is to useawaitand write asynchronous functions. zarr-python also has this facility. That said I do not like how asynchronous functions are colored in Python in that they cannot be easily mixed with non-asynchronous functions.
i don't really understand why you need transactions here? aren't you just iterating over chunk or shard-sized regions? We do have some functions for this kind of iteration. They are currently private but I think there's interest in making these public:
https://github.com/zarr-developers/zarr-python/blob/94d543ccff70f6348029027887868e443ae4cfaa/src/zarr/core/array.py#L5340-L5494
i don't really understand why you need transactions here? aren't you just iterating over chunk or shard-sized regions?
Let's say you want to iterate over 32^3 shaped chunks but have 1024^3 shaped shards. You do want to write the shard until you have processed all 32,786 chunks.
What is the API to not write a shard until you finished processing all the chunks?
What is the API to not write a shard until you finished processing all the chunks?
I see the issue now! we have no such API for this. One idea that comes to mind is writing the chunks to an un-sharded in-memory zarr array, then re-encoding those chunks to a single shard. This would be nice insofar as a "transaction" is actually just creating an array, then re-encoding that array
Yes. That's what the proposed three line syntax should do. It would be even better if it accepted lazy tasks so you don't actually have to buffer the whole shard.
Yes. That's what the proposed three line syntax should do. It would be even better if it accepted lazy tasks so you don't actually have to buffer the whole shard.
@d-v-b
We do have some functions for this kind of iteration. They are currently private but I think there's interest in making these public
Oh that looks nice for some simple iteration functionality, I would love for that to be public.