torchgeo icon indicating copy to clipboard operation
torchgeo copied to clipboard

SedonaDB Dataset

Open isaaccorley opened this issue 1 month ago • 9 comments

Summary

See reasoning here #3160

This PR implements the following:

  • abstracts the spatial filtering ops in VectorDataset into a method VectorDataset.filter_index which can be overridden for new backends.
  • implements torchgeo.datasets.SedonaDBDataset

Benchmarking

I added a benchmarking script that simply iteratively queries the filter_index method of both methods and these are the results. The data I used is the Washington state buildings here from the Microsoft Open Buildings dataset (converted to parquet):

VectorDataset: Initialization time: 2.27 seconds Filter time (50 slices): 105.54 seconds Time per slice: 2.11 seconds Total geometries found: 21,235 Geometries per second: 201.20

SedonaDBDataset Performance: Initialization time: 2.06 seconds Filter time (50 slices): 12.89 seconds Time per slice: 0.26 seconds Total geometries found: 21,235 Geometries per second: 1,647.26

Speedup: 8.19x

SedonaDBDataset is about 8.19x faster than VectorDataset for filtering operations, processing ~1,647 geometries/second vs ~201 geometries/second. Both found the same 21,235 geometries, confirming correctness.

TODO

Possibly need to implement separate SedonaDBVectorDataset and SedonaDBRasterDataset that can use IntersectionDataset and UnionDataset (could be in a follow-up PR though)

cc: @jiayuasu @rbavery @paleolimbot

isaaccorley avatar Dec 01 '25 22:12 isaaccorley

@paleolimbot I appreciate the review for creating the index. I did some initial benchmarks and sedonadb ABSOLUTELY SLAPS at spatial intersection filtering (see PR description above) cc: @calebrob6 @adamjstewart

isaaccorley avatar Dec 02 '25 04:12 isaaccorley

This PR is still a WIP at the moment but so far looking verrrry good. Most of the remaining work needs to be modifying the constructor to use sedonadb to create the index.

Basically the following happens:

  • glob all the files and create an index which contains the filepath, datetime, bounds (still needs to be sedonadb-ified)
  • in getitem, query the index using the slice to find the files of interest, then read and filter the geometries within the files using the query (mostly done but probably could use some optimization)

isaaccorley avatar Dec 02 '25 04:12 isaaccorley

Would a "driver" flag in VectorDataset (that could be "geopandas" or "sedonadb") be possible? E.g. as a user if I want to use an existing vectordataset but want the sick gains of sedonadb, it'd be cool if I could just switch somehow.

Also, I like this benchmarking script and am curious how the old backend performs on it.

calebrob6 avatar Dec 03 '25 16:12 calebrob6

Would like to continue the discussion in #3160 once Isaac returns from paternity leave, but some minor comments on the proposed implementation:

I'm not opposed to supporting multiple backends, but note:

  • This logic is not specific to VectorDataset, the same logic could be applied to all GeoDatasets
  • As mentioned in #3160, this is only a minor fraction of the places we use geopandas, see #2747 for a full list of locations required to truly make TorchGeo backend-agnostic
  • The benchmarking script here is cool, but note that we already have a benchmarking script for RasterDataset. We should either port this to IOBench or replace IOBench with this. It would be nice to integrate this with our torchgeo script so that it works post-installation
  • The benchmarking done here doesn't tell me much about SedonaDB's performance for insertion, intersection, union, caching, pickling, etc. It also doesn't tell me how SedonaDB scales compared to R-tree, shapely, geopandas, or STAC. I also don't know what features Sedona DB supports. We should consider adding SedonaDB to our literature review for TorchGeo 1.0 (currently private git repo, but I can give people access).

More generally, the reason we didn't consider database backends like PostGIS/DuckDB/SedonaDB is my fear that this would require users to know about these technologies, install non-Python deps, and manually set up their own databases. Also, no one suggested them when I asked around during our 6 month backend search. For reference, I tried many times to get the GeoTorchAI unit tests running on my laptop, but this didn't work as I didn't have SedonaDB installed. If this has changed and the setup process is now much easier, we can revisit these, as I expect them to be quite performant depending entirely on I/O speeds. However, note that not all users may have write access on the systems they run TorchGeo on, so we may not be able to switch to a file-based DB as the default.

adamjstewart avatar Dec 08 '25 11:12 adamjstewart

@adamjstewart Hi Adam,

Thank you again for all your contributions to TorchGeo. Since Isaac is currently on leave, I wanted to clarify a few things here.

SedonaDB and Apache Sedona

SedonaDB is a new subproject under Apache Sedona, but it is not the same as SedonaSpark, SedonaFlink, or other distributed Sedona engines. SedonaDB is a single machine data processing tool that requires zero installation beyond a simple pip install apache-sedona[db]. It is designed for embedded and self contained environments. The wheel files for SedonaDB are available on PyPI at
https://pypi.org/project/sedonadb/

SedonaDB and PostGIS

SedonaDB is designed specifically for embedded and self contained use cases. It requires zero database setup and no data ingestion process like PostGIS. It works directly on your data files, identical to the GeoPandas user experience.

PostGIS is more of a transactional database. For analytical workloads such as filtering, joins, unions, and aggregations, SedonaDB is orders of magnitude faster. Because of how large the performance gap is, we did not even include PostGIS in our SpatialBench comparison since it would be an extremely unfair comparison.

SedonaDB and GeoTorchAI

GeoTorchAI was a research prototype that relied on Apache Sedona through SedonaSpark rather than SedonaDB. I agree with you that SedonaSpark can be complicated to operate in certain environments.

SedonaDB functionalities

SedonaDB was released in September 2025 and is positioned as a GeoPandas alternative. You can find the full list of supported functions here
https://sedona.apache.org/sedonadb/latest/reference/sql/

We also conducted a comprehensive benchmark comparing SedonaDB, GeoPandas, and DuckDB using SpatialBench
https://sedona.apache.org/spatialbench/single-node-benchmarks/

SpatialBench is designed to evaluate geospatial analytical performance across different systems. It examines performance from multiple angles including individual spatial functions such as

  • filtering, intersection, and union
  • complex and heavy spatial joins.
  • automatic query optimization across combined operations

For inserting new rows into a GeoPandas DataFrame, I do not believe GeoPandas currently supports this, and neither do SedonaDB or DuckDB.

Hope this helps clarify things and addresses your concerns.

jiayuasu avatar Dec 08 '25 20:12 jiayuasu

Thanks, this actually helps a lot. So we didn't consider SedonaDB during our search because it didn't exist at the time we did our literature review. Glad to know that it's easier to install than Apache Sedona and doesn't require any setup.

SpatialBench is actually of great interest to me. We have our own benchmarking comparing R-tree, Shapely, Geopandas, and STAC. Would be interesting to add some of those to the comparison, although they all obviously have very different features.

I still need to think about the best way to do this. I really don't want a new SedonaDBDataset, as we would have to duplicate all 30+ existing GeoDataset subclasses to actually take advantage of this. Maybe a backend='sedonadb' parameter to GeoDataset and friends would help. Again, there are hundreds of places we use the geopandas representation, not just filtering. We don't have to replace all locations, but if you really want speedups, that may be necessary.

I'll add this to the agenda for our monthly Technical Steering Committee meeting. Not sure if we'll get to it in January or February but I'm more open to this idea now that I understand it better.

adamjstewart avatar Dec 09 '25 12:12 adamjstewart

Part of TorchGeo's API design is to use inheritance by providing base classes. For this reason I don't think it's necessary in this PR to completely redo the backend and add it to every single dataset that inherits from GeoDataset. I was hoping to scope this to just creating an experimental SedonaDBDataset as an optional dependency install that users can start to experiment with. As mentioned, the gains in spatial intersection are quite large and I don't think we should delay these kinds of speedups making it into the library.

It also helps our Wherobots developers to begin to consider optimizations that would help particularly for improving geospatial sampling for ML training workflows.

isaaccorley avatar Dec 09 '25 14:12 isaaccorley

Well this won't make it into a release for quite some time regardless of if we merge today or in a few months. I want to speed up the release cycle after 1.0, but at the moment we're busy breaking GeoDataset and GeoSampler to add time series support. These features need time to test and mature before making it into a release, especially because they are backwards-incompatible.

adamjstewart avatar Dec 09 '25 15:12 adamjstewart

Well this won't make it into a release for quite some time regardless of if we merge today or in a few months. I want to speed up the release cycle after 1.0,

We should discuss this at our monthly meeting since it seems you are prescribing a release schedule without input from the rest of the maintainers. My opinion is that we should really increase our release schedule frequency particularly for certain types of features. I imagine others would agree with this as well. Adding a new UNet model weights for example should not take 6 months to become available from PyPi. 1.0 or not isn't really relevant to this PR.

but at the moment we're busy breaking GeoDataset and GeoSampler to add time series support. These features need time to test and mature before making it into a release, especially because they are backwards-incompatible.

These types of large breaking changes over multiple PRs should consider being moved to a dev or time-series branch that can get fully merged at some point in the future and not be pushed directly to main. These features don't really have anything to do with the SedonaDBDataset in this PR considering what's currently on main and shouldn't hold up any other features before they are merged

My overall 2c is that this features like this should get merged sooner than later instead of being stalled by reasons that aren't relevant to what's proposed in the PR.

isaaccorley avatar Dec 09 '25 16:12 isaaccorley