zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

performance optimization in `delitems`

Open d-v-b opened this issue 2 years ago • 6 comments

in FSStore.delitems, only keys that exist in the store are deleted. This requires checking each key in a loop, which incurs a lot of overhead for filesystems like s3. I suspect that sequentially calling FSStore.__contains__ here erodes a lot of the benefit of using a bulk delete.

From the zarr side, the solution would be to simply request deletion of keys regardless of whether they exist in the store or not, however fsspec raises an exception if you try to rm a file that doesn't exist. So perhaps we need to add an exception-handling kwarg to rm like the on-omit kwarg for FSMap.getitems. This would be more "s3ish" anyway, since the boto3 delete_objects method silently treats missing keys as deleted.

thoughts on this @martindurant ? I'm happy to open a PR on the FSSpec side.

d-v-b avatar Jan 31 '23 21:01 d-v-b

Some method in fsspec allow for on_error=, which can be "ignore", "return" or "raise". This is generally better handled in async backends, since it makes more sense when we are operating on many paths concurrently. I would be happy to see this functionality applied more generally - but it belongs more in fsspec than in zarr.storage. FSStore/FSMap.delitems would simply pass the list of paths to the filesystem.

martindurant avatar Jan 31 '23 21:01 martindurant

Curious how often this comes up? Are there particular workflows that are dependent on regular deletion? Examples would be of interest

jakirkham avatar Jan 31 '23 21:01 jakirkham

I noticed this issue when writing a large quantity of data to s3-backed arrays with write_empty_chunks=False. When write_empty_chunks is false, delitems is used to remove key: chunks pairs if the chunk is found to be empty.

d-v-b avatar Jan 31 '23 23:01 d-v-b

So maybe we would want to go further and batch the deletion and new write together

jakirkham avatar Feb 01 '23 05:02 jakirkham

In this test, I subclass fsstore and override the behavior of delitems to first try deleting keys without checking if they exist. On the s3 backend, this just works, but on local storage FileNotFoundError exceptions will get raised, which is handled by a try... catch block, which falls back to the current implementation of fsstore.delitems. It looks like the base version of delitems is way slower than the patched version. Unless I made a big mistake here, we are leaving a lot of performance on the table.

import zarr
from zarr.storage import FSStore
import time
from pydantic_zarr import ArraySpec
from rich import print

class FSStorePatched(FSStore):
    """
    Patch delitems to delete "blind", i.e. without checking if to-be-deleted keys exist.
    """
    def delitems(self, keys):
        if self.mode == "r":
            raise ReadOnlyError()
        try:  # should much faster
            nkeys = [self._normalize_key(key) for key in keys]
            # rm errors if you pass an empty collection
            self.map.delitems(nkeys)
        except FileNotFoundError:
            nkeys = [self._normalize_key(key) for key in keys if key in self]
            # rm errors if you pass an empty collection
            if len(nkeys) > 0:
                self.map.delitems(nkeys)

store_path = 's3://janelia-cosem-int/test.zarr'
array_spec = ArraySpec(shape=(128,), chunks=(1,), dtype='uint8', compressor=None, attrs={})
stores = {'main': FSStore(store_path), 'delitems_patch': FSStorePatched(store_path)}
[array_spec.to_zarr(store, path=name, overwrite=True) for name, store in stores.items()]

nreps = 8
results = {}

for name, store in stores.items():
    arr = zarr.open(store, path=name, write_empty_chunks=False, mode='a')
    results_local = []
    for n in range(nreps):
        start = time.time()
        arr[:] = 0
        results_local.append(time.time() - start)
    results[name] = results_local

print(results)

returns the following results:

{
    'main': [
        8.254249095916748,
        8.017521142959595,
        7.679267406463623,
        7.912853479385376,
        7.9481360912323,
        8.017132997512817,
        8.245590448379517,
        8.08942461013794
    ],
    'delitems_patch': [
        0.9219014644622803,
        1.8048920631408691,
        0.9862747192382812,
        1.1363592147827148,
        1.0299558639526367,
        1.054429054260254,
        0.8815197944641113,
        0.9305408000946045
    ]
}

I think the approach here via the try...except block is very inelegant, but it totally works today without requiring changes to fsspec, in case anyone is interested in seeing this materialize as a PR. Longer term, we must decide if attempting to delete a key that does not exist should be an error or not. Because the key space of zarr objects is statically defined, there is no chance that a store would attempt to delete an "invalid" key, so there is no utility in having "delete key + absent key" represent an error state. For this to work, we might need to modify fsspec, in particular the local file system operations...

d-v-b avatar Sep 14 '23 21:09 d-v-b

(you will have to supply your own s3 bucket for this test to work)

d-v-b avatar Sep 14 '23 21:09 d-v-b