Implement np.delete to surgically remove pieces of an array along an axis
Currently we only support simple resizing of zarr arrays. In the case of shrinking, data is removed from the end of axes. However, users, might want to remove data from the middle of arrays, e.g. deleting a row or column. In this case, we could support an api like numpy.delete, which has the following signature
numpy.delete(arr, obj, axis=None):
"""Return a new array with sub-arrays along an axis deleted. For a one dimensional array, this returns those entries not returned by arr[obj]"""
To keep things simple, we could restrict this to cases where the delete operation aligns exactly with existing chunks. In that case, the operation would involve two steps:
- Update the array shape
- Renaming chunks
If I understand this proposal correctly, deleting from the beginning of a large array would entail renaming every chunk in the array. I'm curious how this can be made robust -- I don't think I've used CopyObject on s3, so I have no intuition for how much latency it introduces, but suppose it adds substantial latency. What happens if a user cancels during the renaming step?
To keep things simple, we could restrict this to cases where the delete operation aligns exactly with existing chunks.
Would it make sense to consider building this on top of a user-facing chunk-centric API? Otherwise users will have to manually track that their array indexing is chunk-aligned.
deleting from the beginning of a large array would entail renaming every chunk in the array.
Yup it would. Some stores support this cheaply (e.g. filesystem, Arraylake); on others it's very expensive (s3). It could be a fragile operation for sure.
The consistency issue you raised is not unique to this operation. Interruptions during any modifications to the store (e.g. move) have similar challenges.