Can I speed up the spatialdata.polygon_query when dealing with really big datasets?
Hello,
thank you for providing the community with such a great tool!
I was wondering if you could help or advise on the following.
I have a very large xenium dataset (~2 million cells). I have read it and saved it as a zarr file and I have read it back in on my notebook. At the same time I have used QuPath to create annotations for my images, I have saved them as geojson files and read them back in as a geodataframes and subsequently converted them into polygons.
I would like to add this type of metadata information in my sdata object, so that I can annotate and filter my sdata.table based on that annotation. The way that I am currently trying to do this is by using the following piece of code, really in an effort to create subsets of my dataset based on polygons which I can then assign the annotation.
from spatialdata import polygon_query # type: ignore
cropped_sdata2 = polygon_query(
sdata,
polygon=polygon,
target_coordinate_system="global"
)
This piece of code is extremely slow because of the size of the dataset, so I was wondering if there is a way to further speed up this function (I am already running this analysis on a server), or if there is a simpler way incorporate my annotation to the sdata object.
Thanks a lot for your help!
Cheers, Anastasia
Hi, thanks for reaching out and for the feedback. We are aware of the performance bottleneck in query operations, and in this issue https://github.com/scverse/spatialdata/issues/742 I discuss some of the options that can be considered to speed up querying data.
In short,
- one could use a spatial index to speed up
geopandas.sjoin. - For large collections of shapes, also using dask-geopandas + spatial partitioning would reduce the memory usage by limiting the number of shapes loaded from disk (with the current implementation all the shapes are loaded in-memory at once).
- After 1), one could also try out using GPU acceleration.
Unfortunately, at the moment we are putting most of our efforts on the file format side, so we won't have time to implement the above, but our plan is to write some exploratory code in one of our hackathons. This was planned for a scverse hackathon in November 2024, but we didn't manage to cover the topic. We will try to have a look at this at the next hackathon in April 2025 🤞.
Thank you so much for the very quick and detailed response! I will explore the options you have listed here and also what you have discussed already in issue #742 and see if I can speed up what I am currently running.
It is really great to know that you guys are going to inspect this further in April!
I hope the hackathon is fruitful : )
Best! Anastasia