spatialdata icon indicating copy to clipboard operation
spatialdata copied to clipboard

write_element and delete_element_from_disk are very slow when SpatialData object contains large number of elements

Open mjheid opened this issue 7 months ago • 1 comments

I work on a dataset with 1000 images and create and delete a lot of labels for these images. Writing and deleting label objects via write_element / delete_element_from_disk can take up to 60s per element when >10k elements are in the SpatialData object. The slowdown mostly happens in elements_paths_on_disk in spatialdata._core.spatialdata. Following change helped for me fix the issue, saving ~1000 labels in 40s:

def elements_paths_on_disk(self) -> list[str]:
    """
    Get the paths of the elements saved in the Zarr store.

    Returns
    -------
    A list of paths of the elements saved in the Zarr store.
    """
    if self.path is None:
        raise ValueError("The SpatialData object is not backed by a Zarr store.")
    store = parse_url(self.path, mode="r").store
    elements_in_zarr = []

    groups_stored = store.listdir()
    for group in groups_stored:
        if group in ["images", "labels", "points", "shapes"]:
            group_elems = [os.path.join(group, elem) for elem in store.listdir(group)]
            elements_in_zarr.extend(group_elems)
    return elements_in_zarr

In delete_element_from_disk calling write_consolidate_metadata takes a long time( ~1min). When expecting to delete a lot of images other users should call sdata.write() with consolidate_metadata=False, or rewrite delete_elements_from_disk such that when given a list of elements to delete write_consolidated_metadata is called only once at the end of the list.

Code to reproduce problem:

from spatialdata.datasets import blobs
import numpy as np
import time
import spatialdata as sd

sdata = blobs()
sdata.write('test', consolidate_metadata=True)

test = np.empty((1,1), dtype=np.uint8)
for i in range(1500):
    sdata[f'test{i}'] = sd.models.Labels2DModel().parse(test, dims=('y','x'))
    start = time.time()
    sdata.write_element(f'test{i}')
    print(f'Wrote test{i} in ', time.time()-start)
for i in range(1500):
    start = time.time()
    sdata.delete_element_from_disk(f'test{i}')
    print(f'Deleted test{i} in ', time.time()-start)

mjheid avatar May 21 '25 12:05 mjheid

Sorry for the late response, but thanks for the report! Will add an action point for tutorial

melonora avatar Jun 10 '25 09:06 melonora