performance optimization in `delitems`
in FSStore.delitems, only keys that exist in the store are deleted. This requires checking each key in a loop, which incurs a lot of overhead for filesystems like s3. I suspect that sequentially calling FSStore.__contains__ here erodes a lot of the benefit of using a bulk delete.
From the zarr side, the solution would be to simply request deletion of keys regardless of whether they exist in the store or not, however fsspec raises an exception if you try to rm a file that doesn't exist. So perhaps we need to add an exception-handling kwarg to rm like the on-omit kwarg for FSMap.getitems. This would be more "s3ish" anyway, since the boto3 delete_objects method silently treats missing keys as deleted.
thoughts on this @martindurant ? I'm happy to open a PR on the FSSpec side.
Some method in fsspec allow for on_error=, which can be "ignore", "return" or "raise". This is generally better handled in async backends, since it makes more sense when we are operating on many paths concurrently. I would be happy to see this functionality applied more generally - but it belongs more in fsspec than in zarr.storage. FSStore/FSMap.delitems would simply pass the list of paths to the filesystem.
Curious how often this comes up? Are there particular workflows that are dependent on regular deletion? Examples would be of interest
I noticed this issue when writing a large quantity of data to s3-backed arrays with write_empty_chunks=False. When write_empty_chunks is false, delitems is used to remove key: chunks pairs if the chunk is found to be empty.
So maybe we would want to go further and batch the deletion and new write together
In this test, I subclass fsstore and override the behavior of delitems to first try deleting keys without checking if they exist. On the s3 backend, this just works, but on local storage FileNotFoundError exceptions will get raised, which is handled by a try... catch block, which falls back to the current implementation of fsstore.delitems. It looks like the base version of delitems is way slower than the patched version. Unless I made a big mistake here, we are leaving a lot of performance on the table.
import zarr
from zarr.storage import FSStore
import time
from pydantic_zarr import ArraySpec
from rich import print
class FSStorePatched(FSStore):
"""
Patch delitems to delete "blind", i.e. without checking if to-be-deleted keys exist.
"""
def delitems(self, keys):
if self.mode == "r":
raise ReadOnlyError()
try: # should much faster
nkeys = [self._normalize_key(key) for key in keys]
# rm errors if you pass an empty collection
self.map.delitems(nkeys)
except FileNotFoundError:
nkeys = [self._normalize_key(key) for key in keys if key in self]
# rm errors if you pass an empty collection
if len(nkeys) > 0:
self.map.delitems(nkeys)
store_path = 's3://janelia-cosem-int/test.zarr'
array_spec = ArraySpec(shape=(128,), chunks=(1,), dtype='uint8', compressor=None, attrs={})
stores = {'main': FSStore(store_path), 'delitems_patch': FSStorePatched(store_path)}
[array_spec.to_zarr(store, path=name, overwrite=True) for name, store in stores.items()]
nreps = 8
results = {}
for name, store in stores.items():
arr = zarr.open(store, path=name, write_empty_chunks=False, mode='a')
results_local = []
for n in range(nreps):
start = time.time()
arr[:] = 0
results_local.append(time.time() - start)
results[name] = results_local
print(results)
returns the following results:
{
'main': [
8.254249095916748,
8.017521142959595,
7.679267406463623,
7.912853479385376,
7.9481360912323,
8.017132997512817,
8.245590448379517,
8.08942461013794
],
'delitems_patch': [
0.9219014644622803,
1.8048920631408691,
0.9862747192382812,
1.1363592147827148,
1.0299558639526367,
1.054429054260254,
0.8815197944641113,
0.9305408000946045
]
}
I think the approach here via the try...except block is very inelegant, but it totally works today without requiring changes to fsspec, in case anyone is interested in seeing this materialize as a PR.
Longer term, we must decide if attempting to delete a key that does not exist should be an error or not. Because the key space of zarr objects is statically defined, there is no chance that a store would attempt to delete an "invalid" key, so there is no utility in having "delete key + absent key" represent an error state. For this to work, we might need to modify fsspec, in particular the local file system operations...
(you will have to supply your own s3 bucket for this test to work)