zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Possible feature request on saving array mechanism

Open AhmetCanSolak opened this issue 3 years ago • 1 comments

Hello all,

As far as I understand, current zarr-python implementation is overwriting all the chunks by default when you have an array that is already saved, had some changes and will be saved again. (correct me if I am wrong but here what I am looking it: https://github.com/zarr-developers/zarr-python/blob/505810c44108328ec5732ad8460057f016994fd3/zarr/convenience.py#L170 for example). In some cases(like image acquisition softwares that are trying to save more chunks as data arrives and continues to write chunks over hours/days) it can be wasteful overwrite all chunks especially if the only the new chunks are different chunks.

Less wordy explanation of the concern can be:

  • imagine you have an array on disk with 1000 chunks.
  • you want to append let's say 1000 more chunks of data to the array.
  • you want zarr api to realized first 1000 chunks will be identical anyway and not spend time overwrite it and directly only add new chunks.

Here at opensci2022 meeting, I have been discussing this with @jakirkham and he suggested one can resize the array first and fill only the new chunks with newly available values/frames. I think it is a valid way to address the concern. I like to discuss if we can possibly implement this internally and do it by default if possible. It may or may not change the existing public API(happy to discuss here). A few implementation ideas:

  • there is a require_dataset endpoint: https://github.com/zarr-developers/zarr-python/blob/ce129a560d48854aee533bb2699a3f28b396bc22/zarr/hierarchy.py#L997 , maybe we can implement a similar function that is require_chunks and does the check internally and we can call such function in the save_array endpoint?
  • there is already an append API here: https://github.com/zarr-developers/zarr-python/blob/43266eec01561186b1b32e2fe3b12247130a0f0d/zarr/core.py#L2507 but I am not sure if this would work as I explain above at all the times? I understood it works per axis at a time.

Any ideas/comments/discussions welcome!

AhmetCanSolak avatar Sep 21 '22 17:09 AhmetCanSolak

Thanks, @AhmetCanSolak. Cross-linking here as during the community meeting: https://github.com/zarr-developers/zarr-python/issues/1017

joshmoore avatar Sep 21 '22 18:09 joshmoore