zarr-python
zarr-python copied to clipboard
Support discarding chunks
I'd like to have the option of discarding chunks from the store by telling the Array which parts I don't need anymore.
Use-case, keeping a rolling window:
z = zarr.open('/tmp/test.zarr', mode='w', shape=(100000, 1000, 1000), chunks=(100, 100, 100), dtype='i4')
while true:
measurements = np.arange(1000000).reshape(1000,1000) #simulate important data
z[n] = measurements
sleep(360)
(I know that writing incrementally like this is very slow, because of the write-amplification. But we append only every other minute and the chunking is optimized for our read patterns.)
Now I'd like to clear old measurements to free space on disk. I didn't find a way to unset data, so the only way to reduce the chunks is to set them to a constant value:
z[0:100] = 0
But that is slow and doesn't clear the chunks. It just makes them smaller. So what I'd like to be able to do here, is to write
z.discard(slice(0, 100))
and the store deletes all chunks that are completely in that slice.
Behaviour of proposed discard() method:
- Deleting is optional: The underlying store may delete chunks or do nothing. This allows keeping simple implementations around like memory stores where discarding is not important.
- Only full chunks are deleted. If a chunk is only partially in the slice, it will be kept. This means that the call does not cause heavy write-load by rewriting chunks.
Background:
We're using Zarr to store meteorological data such as satellite imagery. The arrays have the dimensions time, x, and y. For many datasets we have a rolling window of images we keep. Currently, we delete old chunks by a script that knows the NestedDirectoryStore on-disk format. But it would be nicer to have this option in the API. I completely understand if you think that such usage is not the primary focus of Zarr. The workaround works well enough for us.
Have you tried deleting chunks through z.chunk_store?
@jakirkham To get the filesystem path I use z.store.path, appending z.path to it. And then I delete from the filesystem. Did I miss something in the Storage API?
Would look at deleting chunks through z.chunk_store (where z is an Array). Can just do del z.chunk_store["0.0"] for example. It's in the docs under Array attributes, but it is difficult to link to directly (as Sphinx doesn't add anchors for attributes).
https://zarr.readthedocs.io/en/stable/api/core.html
Hi. Is there a trick to compute the chunks id from source coordinates ? Thanks.
Hi @christophenoel you probably want to look at the methods in zarr/core.py:
_get_selection
\_ _chunk_getitem
\_ _chunk_key
Thanks. This looks a little too tough to me. Cheers.
I have created a TrimmableDirectoryStore which acts as a drop-in replacement for DirectoryStore and can adjust and store an internal chunk coordinate offset in the metadata. I overwrite __getitem__ and __setitem__ to calculate this offset each time a chunk is requested. The new class also offers methods for deleting chunks below the offset.
def __setitem__(self, path: str, item):
"""Applies offset first"""
return super().__setitem__(self.__apply_offset(path), item)
def __getitem__(self, path: str):
"""Applies offset first"""
return super().__getitem__(self.__apply_offset(path))
The idea is borrowed the UK Met Office Informatics Lab https://github.com/informatics-lab/rolling_zarr_array_store/blob/master/rolling_zarr_array_store/init.py
Since 2.15.0 you also need to implement the getitems method.
NOTE: this is experimental, but I've found it to work quite fine so far. I could see with my company if it's OK to release the source of this store.