Vector Dataset Backends
Summary
We currently use geopandas as our backend for reading and storing geometry data. While geopandas is great, it's known to be quite slow, compared to DuckDB and more recently SedonaDB, when performing spatial filtering operations on a single node when the number of geometries grows large.
SedonaDB performed a benchmark comparison using SpatialBench.
We should support more than just geopandas as a backend. I propose we implement a torchgeo.datasets.SedonaDBDataset for loading vector dataset into SedonaDB dataframes and performing spatial filtering on them.
cc: @jiayuasu @rbavery @paleolimbot
Looking forward to it!!
We should support more than just geopandas as a backend. I propose we implement a
torchgeo.datasets.SedonaDBDatasetfor loading vector dataset into SedonaDB dataframes and performing spatial filtering on them.
To clarify, we use geopandas for many things:
- VectorDataset: reading vector files from disk
- GeoDataset: reading some custom datasets (e.g., point datasets)
- GeoDataset: storing spatiotemporal metadata
- GeoSampler: determining where to query from
- IntersectionDataset: computing spatiotemporal intersection
- UnionDataset: computing spatiotemporal union
- Splitters: computing splits based on bboxes, roi, toi, etc.
Depending on what you are proposing:
- Add support for reading from a SedonaDB and converting it to our geopandas representation 👍
- Replace geopandas with SedonaDB 👎
- Support multiple backends (R-tree, geopandas, Sedona, etc.) for all of the above 👎
- A third kind of base class (GeoDataset, NonGeoDataset, SedonaDBDataset) 👎
I'm fine with 1, and it seems easy. See below for my thoughts on 2. 3 sounds like an absolute nightmare, as different backends support completely different features. 4 is also a no-go, as it won't be compatible with any other existing datasets. We're already trying to find ways to unify GeoDataset and NonGeoDataset (see TileDataset discussion), I don't want to go in the opposite direction.
For reference, the switch from R-tree to geopandas took 6 months of research, 1 month of implementation, and 1 month to review. I hope to update backends approximately once every 5 years. SedonaDB was not included in the initial search as no one suggested it when I asked around and it doesn't look as easy to install and configure, although I could be wrong. Ease of installation and ease of use is my top priority; speed is somewhere further down the list. I would be happy to include it in the search table if you want to do the research; we can discuss this in the next time series meeting.
If you take a look at my PR it is not that much effort. I'm not proposing swapping out the entire backend for every single dataset. The main issue is that we don't have methods in VectorDataset that abstract out the slow query parts which can be overridden by another class.
I think you are incorrect in de-prioritizing speed. Geopandas is very slow and our initial paper was proposing fast geospatial dataloading. We should provide alternatives when the gains are +5x speed up.
From a user's perspective, they aren't going to experience any change in ease of use or installation because it would all be abstracted away from them.
I see SedonaDBDataset as similar to XarrayDataset. Not sure why this is any different imo.
So your proposal is 3, but for limited parts of the library? If we do that, I think we'll be constantly switching back and forth between geopandas and SedonaDB, which sounds slower than staying in-memory. Do we create and instantiate a SedonaDB version of the dataset at every invocation, or cache it somehow?
Note that apache-sedona doesn't yet have Python 3.14 wheels. Not a deal breaker, but I was hoping to start testing 3.14 as soon as rasterio makes a new release.
SedonaDB is a Wherobots maintained library so I can just ask my colleagues to add Python 3.14 wheels, no big deal there.
I think we need to consider what parts of the vector dataloading are incredibly slow and what are negligible and consider supporting different backends for these parts only.
I think we can be smart about where to make spatial intersection/union operations modular so that can be elegantly overridden. For example, I noticed VectorDataset kind of just does everything inside the getitem method. We can abstract this into separate components that can be overridden, self.filter_index(), self.filter_geoemetries(), self.load_target().
I was hoping to get an initial PR in for SedonaDBDataset similar to XarrayDataset that is just optional for a user to utilize if they want to try out faster dataloading. Then we could go from there. The scope of this issue is a bit larger than what my PR is proposing so maybe we move the conversation there.
Note that apache-sedona doesn't yet have Python 3.14 wheels.
The sedonadb package on PyPI has 3.14 wheels (apache-sedona is wrapper code that installs nice user-facing dependencies updated at a slower release cadence)...you should probably use the lower-level PyPI package/API.
I'm too new to weigh in on the implementation details, but if there's an operation that can be made 8 times faster it seems like adding an experimental opt-in feature may be a reasonable way to evaluate some of these tradeoffs.