duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

Add basic support for GeoSpatial type

Open handstuyennn opened this issue 4 years ago • 11 comments

Hi DuckDB team,

I would like to implement the GeoSpatial data type in DuckDB. This will require quite a bit of work, so I propose breaking it up into two parts. The first part will be implementing the following, which will not be a ‘complete feature’ but I think would be good to get some feedback on making sure I am along the right path with the implementation. Here is how I would break it down: Part 1 (partial implementation)

  • [ ] Support for the POINT geospatial object.
  • [ ] Support for one Constructor function, ST_MAKEPOINT.
  • [ ] Support for one Input function, ST_GEOGFROM.
  • [ ] Support for one Output function, ST_ASTEXT.
  • [ ] Support for one Accessor function, ST_X.
  • [ ] Support for one Calculation/main function, ST_DISTANCE.
  • [ ] Support for one Transformation function, ST_CENTROID. (Note: this will return itself for a point, so for the time being it will be trivial).

Once I complete this I will discuss what I’ll work on for Part 2, but it will be adding basic support for the main geospatial objects, such as LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection, and adding some of the most common functions for each of the six function types, and provide suggestions for future improvements if someone want to continue on the work (such as adding in indexes or more functions). I will be using PostGIS and MySQL for references on open source implementations, though I think PostGIS is the standard here and since it works so well with Postgres, I will probably use that primarily for reference and copying over some of the main components. @Mytherin -- could you please review the above and let me know your thoughts on this?

handstuyennn avatar Dec 23 '21 11:12 handstuyennn

Thanks for your interest in implementing this! I think that sounds good. We would want this implemented as an extension (similar to the ICU extension) rather than as part of the core library. You could create a geo folder in the extension project that contains the code required. Points and other data structures could be implemented as structs (e.g. POINT is really just ROW(x INT, y INT)). Functions can be added to operate on those structures similar to how other extensions currently register functions.

Mytherin avatar Dec 23 '21 15:12 Mytherin

Hi @Mytherin,

Thanks for the feedback and glad that it seems like it'd be useful for the library. I do have a few concerns about using Struct as the base type, especially when encoding more complex object, such as Polygon or GeometryCollection.

I think perhaps using the Binary type would give us the greatest flexibility, and also the way Postgis stores this (WKB is pretty common), and would help a tremendous amount in allowing us to more easily port over existing functions from there. What do you think?

handstuyennn avatar Dec 24 '21 05:12 handstuyennn

Binary will be less efficient than structs in the case of simple types. For example, if you store a POINT as a ROW(x INT, y INT) the system will essentially just be storing two integer columns. Integer columns have a lot of nice properties, such as they are fixed size, easy to compress, and cheap to propagate through the system. Storing the same type as a single BLOB column means every value now needs to be annotated with the length, compression schemes will generally be less effective as the data is now stored in row-wise order, and generally more heavy weight compression schemes will need to be used because nice integer compression schemes cannot be used on blob data.

In Postgres this matters less since (a) data is stored row-wise anyway, (b) Postgres does not have support for compression on regular columns, and (c) because of Postgres' storage layout there is not as much of a difference between storing two integers and storing a blob, e.g. there will always be per-row length fields stored which are not required in DuckDB.

Many of the benefits of DuckDB's columnar compressed storage over Postgres' storage are lost if you store types as Binary. I think most geometry can be stored using a combination of structs and lists, for example, a Polygon is really just a list of points.

I imagine that Binary types are easier to work with internally, especially if that allows for code re-use (although remember that PostGIS is licensed under GPL2, so please don't send PRs to this repo containing code from there). If you are going to be the one implementing this it is really up to you, but I imagine the best performance would be achieved using nested types.

Mytherin avatar Dec 24 '21 11:12 Mytherin

Hi! Ive thought about this as well, and even planned to maybe propose this as the subject of my masters thesis (eventually even reach out to CWI for supervision/review?). I agree that an implementation should try to leverage DuckDBs columnar format instead to improve upon the shortcomings of postgis performance wise.

There has been some talk about a geospatial type in arrow/avro/parquet, but it looks like it is somewhat tricky to make it flexible enough for all use-cases. Particularly, do we want to support/how do we support higher-dimensional geometry?

Personally, I think we should start by only supporting 2D geometry, potentially 3D as well with a separate class of types (Point3D, etc, I think thats how postgis does it?) as they seem to be the most commonly used by far. If a columnar geospatial format eventually ends up being stabilized we can adapt to it later, or maybe just implement it as a comaptible IO format but still store it internally using DuckDBs types)

We should probably include GEOS/GDAL and avoid writing our own geospatial algorithms to start, as long as we can convert to their internal geometry representation we should be able to use them to output WKB/GeoJSON (and a eventual arrow-based format if it ever gets stabilized, as I imagine one of the first implementations will be pushed to GEOS first), however It might not be zero-copy to begin with, but I think thats ok for an initial implementation.

Do you guys have any thoughts about implementing a future Raster type as well? Perhaps it would be better suited as binary, and implemented as another extension. If we succefully integrate the classic geospatial libraries I would expect we get a lot of raster processing functionality for free.

Maxxen avatar Dec 24 '21 12:12 Maxxen

Although, if we use Structs and Lists to store the actual data, how do we store metadata such as what projection the geometry is in, or precision? Do we include another column so that a point is ROW(x INT, y INT, srid ENUM) Or perhaps there are ways to enforce that on a columnar or table level? Im not sure what mechanisms there are to do that in DuckDB. Is it even desireable to store geometry of different projections in the same column?

Maxxen avatar Dec 24 '21 12:12 Maxxen

This seems to be a commonly requested feature, see: #2429, #1149

Maxxen avatar Dec 24 '21 15:12 Maxxen

Hi! Ive thought about this as well, and even planned to maybe propose this as the subject of my masters thesis (eventually even reach out to CWI for supervision/review?). I agree that an implementation should try to leverage DuckDBs columnar format instead to improve upon the shortcomings of postgis performance wise.

Happy to discuss further, email me at [email protected]

hannes avatar Dec 25 '21 06:12 hannes

Hi @Mytherin,

Thanks for the feedback. I have a question regarding general licensing information. It's not necessarily a question specific to this feature, but just for some clarification on importing libraries, functions, or parts from other codebases for this feature. Libraries that I may use for reference include the following with their licenses:

Of the above, which licenses and projects are we able to utilize? For example, to import a module? To use a function? etc. If you could give general guidance here that would be great, thanks for your time.

handstuyennn avatar Dec 27 '21 01:12 handstuyennn

MIT is fine as long as any copied code is labeled as such and includes the original license. MPL/LGPL/GPL is not if you want to include it in this repository.

Note that linking to those libraries as an external dependency is fine. Looking at behavior and checking that functions that you write have the same semantics/behavior as those other libraries is also fine. Including code directly from those repositories is not.

It is also possible to create the geom extension in a separate repository as an external DuckDB extension which would then be licensed under the GPL license. In that case code from any of those could be included, as long as the original license is included as well.

Mytherin avatar Dec 27 '21 19:12 Mytherin

Just happening by this issue out of curiosity; I'm not even a user of DuckDB (yet!), but thought I'd mention the following IMHO valuable geospatial libraries, purely because it might be worth thinking about them long before they're actually being tied into DuckDB.

  • Google's S2 (C++ implementation) is Apache 2.0 licensed; supports geospatial operations on a 2-sphere as well as a Hilbert curve based geospatial index. The spherical representation of Earth is, of course, only valid for low-accuracy applications.
  • Uber's H3 (C implementation) is likewise Apache 2.0 licensed and supports a hexagonal Hilbert curve based geoindex including movement encoding (go figure...).
  • PostGIS's liblwgeom specifically is valuable for great-arc distance calculations using non-projected data. As opposed to S2, it represents Earth as a spheroid, and hence is more precise.

A whole host of open source geospatial products are built around GEOS/GDAL/PROJ, dealing with planar geometries, as opposed to the libraries above. Hilbert curves are funny in that sense - they collapse spatial coordinates down to a single dimension, with all the B-tree advantages that entails.

mmyrte avatar Jul 30 '22 19:07 mmyrte

Hey @mmyrte,

Good list and great observation on planar-vs-spherical! I started to prototype an extension based on spherical geometry (geography type) and S2. It's very early stage and some more work is needed to migrate from a physical type to native DuckDB's complex type. But if you are curious, please have look.

https://github.com/dmitrykoval/duckdb/pull/1

dmitrykoval avatar Aug 03 '22 16:08 dmitrykoval

Hi @mytherin — I’m going to continue working on this issue now that custom types have beed added in from https://github.com/duckdb/duckdb/pull/4785. Could you please let me know if the above roadmap from my initial comment in this thread looks like a good way to continue working on this, or if you have any other feedback?

handstuyennn avatar Oct 04 '22 07:10 handstuyennn

Perhaps you could discuss this with @Maxxen?

Mytherin avatar Oct 04 '22 07:10 Mytherin

Let me know if I should add this elsewhere but I'd love to see, in addition to what is suggested above,:

  • ST_WITHIN
  • ST_INTERSECTS

lindsay-pettingill avatar Oct 14 '22 17:10 lindsay-pettingill

Hi. Just want to point here for reference the work that is being done on GeoParquet:

https://github.com/opengeospatial/geoparquet

And there has been some discussions on adding spatial index support in here:

https://github.com/opengeospatial/geoparquet/issues/13

And also on a Arrow efficient encoding format here:

https://github.com/geoarrow/geoarrow

Which is used by Geopandas.

So there is quite a bit to build on top of.

jatorre avatar Nov 08 '22 16:11 jatorre

Hi @handstuyennn,

Thank you for proposing this extremely useful feature. Is this still actively worked on?

mmerdes avatar Nov 11 '22 15:11 mmerdes

Hi @hannes and @Mytherin,

Are there plans to add spatial indices, such as an R-Tree? This would be a great complement to the spatial data types.

mmerdes avatar Nov 11 '22 15:11 mmerdes

Hi @handstuyennn,

Thank you for proposing this extremely useful feature. Is this still actively worked on?

@handstuyennn created a small Geography extension built with liblwgeom at https://github.com/handstuyennn/geo, but I'm also have a small prototype using duckdb native datatypes and functions, and have started working on a R-Tree index implementation. I think the r-tree will take more time but I hope to make the initial extension work and discussion public sometime in the the coming weeks.

Maxxen avatar Nov 11 '22 15:11 Maxxen

Hi @hannes and @Mytherin,

Are there plans to add spatial indices, such as an R-Tree? This would be a great complement to the spatial data types.

I think the plan is to eventually generalize the index mechanism and allow for indexes to be defined in extensions. But that will require more implementation and design work in core duckdb first. There is some related work here, and I think something similar to Postgres GIST/GIN interfaces could be cool to explore. Although the ART index is still undergoing lots of development so I imagine we probably want to let that cool off for a bit before doing any major refactors.

Maxxen avatar Nov 11 '22 15:11 Maxxen

Hi @handstuyennn, Thank you for proposing this extremely useful feature. Is this still actively worked on?

@handstuyennn created a small Geography extension built with liblwgeom at https://github.com/handstuyennn/geo, but I'm also have a small prototype using duckdb native datatypes and functions, and have started working on a R-Tree index implementation. I think the r-tree will take more time but I hope to make the initial extension work and discussion public sometime in the the coming weeks.

This is excellent news, @Maxxen! Thank you for the super-fast reply.

mmerdes avatar Nov 11 '22 15:11 mmerdes

Geography extension built with liblwgeom at handstuyennn/geo

As far as I can tell, the source for liblwgeom is here in postgis, which would appear to inherit postgis' GPL2 license. Given that duckdb is MIT-licensed, is that an issue? Would geo support based on liblwgeom be a GPL2-licensed "add-on"?

kylebarron avatar Nov 11 '22 16:11 kylebarron

I just want to highlight that GDAL can be used not just for computing, but also to open a wide variety of spatial resources, both local and remote, which would make it incredibly useful to use it with DuckDB:

  • https://gdal.org/drivers/vector/index.html
  • https://gdal.org/drivers/raster/index.html
  • https://gdal.org/user/virtual_file_systems.html

A possible workaround is to expose a vector dataset to Postgres using https://www.postgresql.org/docs/current/postgres-fdw.html and then connect DuckDB to it.

frafra avatar Feb 15 '23 14:02 frafra

I just want to highlight that GDAL can be used not just for computing, but also to open a wide variety of spatial resources, both local and remote, which would make it incredibly useful to use it with DuckDB:

  • https://gdal.org/drivers/vector/index.html
  • https://gdal.org/drivers/raster/index.html
  • https://gdal.org/user/virtual_file_systems.html

A possible workaround is to expose a vector dataset to Postgres using https://www.postgresql.org/docs/current/postgres-fdw.html and then connect DuckDB to it.

I've been working on exactly this. It works using GDALS new arrow interface, even got spatial extent pushdown and OGR SQL predicate pushdown to work, although no projection pushdown yet. But building and distributing GDAL statically in a portable way is kind of a pain due to the massive amount of transitive dependencies. The debug build is almost 500mb already, so we might want to split off the GDAL table function into a separate extension at some point. Im hoping to have something to share soon however.

Maxxen avatar Feb 15 '23 23:02 Maxxen

I just want to highlight that GDAL can be used not just for computing, but also to open a wide variety of spatial resources, both local and remote, which would make it incredibly useful to use it with DuckDB:

  • https://gdal.org/drivers/vector/index.html
  • https://gdal.org/drivers/raster/index.html
  • https://gdal.org/user/virtual_file_systems.html

A possible workaround is to expose a vector dataset to Postgres using https://www.postgresql.org/docs/current/postgres-fdw.html and then connect DuckDB to it.

I use parquet files converted files from gpkg/shp to parquet (with ogr) and it works. GDAL is quite hard to embed. But as extension for reading files, it could be nice, like the sqlite_scanner or. the postgres_scanner.

blackrez avatar Feb 16 '23 08:02 blackrez

Possibly dumb question: why is GDAL necessary? Doesn't GEOS provide the necessary spatial analysis functions, index structures etc., while being far smaller and easier to embed that GDAL?

ammojamo avatar Mar 13 '23 05:03 ammojamo

Yes, GEOS is for computations, but GDAL does IO and conversion between vector formats.

Maxxen avatar Mar 13 '23 07:03 Maxxen

Well, happy to say we've got something experimental to share on this end: https://github.com/duckdblabs/duckdb_spatial It's still very early in development so expect lots of functions missing, things to break and be bugs for a bit. That said I think we got a pretty good core to build upon further. I'm sharing this here because I assume everyone that's commented on this thread are particularly interested in DuckDB's geo story, but please don't go wild with this on twitter or whatever until it has gotten a chance to stabilise a bit and we can get a blog post out. Feel free to post issues, suggestions, discussions or contribute in any way though.

  • Only 2D planar geometries for now
  • GDAL based copy/scan table functions
  • statically links and bundles all dependencies, including an embedded PROJ database.
  • arena-allocated memory management
  • uses the reentrant geos and proj apis' (which I don't think lwgeom does by default?)
  • experimental specialized columnar geometry types
  • MIT license

Maxxen avatar Apr 05 '23 21:04 Maxxen

Oh wow, thats pretty sweet! congrats to all, cant wait to try it.

jatorre avatar Apr 05 '23 21:04 jatorre

Why not invest in the already existing extension geo? The work is quite advance and it is ready to use.

blackrez avatar Apr 06 '23 07:04 blackrez

Good question! I think there are multiple reasons that add up, but I think the two big ones are:

  • GPL is more complicated in the case of DuckDB compared to Postgres. I'm not a lawyer, but my interpretation is that since DuckDB runs "in process" it is always "part" of your program, so if you want to distribute your programming running DuckDB with a linked GPL extension you will have to license your program under GPL too. As GPL does not have a linking exception, (AFAIK) linking a GPL extension does not protect you from the GPL virality. This is in contrast to Postgres/gis where there is a MIT licensed client library and a wire protocol separating your code from the GPL. You could argue that this is not a problem in practice due to the "saas loophole", I.e that most people run DuckDB in their backend which they don't "distribute", or just use it locally for their own purposes, but this is not always the case, we're investing a lot in WASM and mobile (swift/java) - both platforms where you still distribute software to the end user.
  • While LWGEOM is a fantastic c library, I think that adapting it to get the best out of DuckDB fully is going to require a lot of changes, which again, the GPL requires you to properly track, clearly document and attribute. Particularly when it comes to memory management and thread safety. PostGIS does not have to deal with this in the same way as they don't use threads but instead spawn multiple worker processes. Last I checked LWGEOM doesn't use the reentrant versions of the proj and geos apis, but I'll admit I don't know how big the consequences of that is for DuckDB in practice.

That said, I do think its going to take a while until spatial catches up to geo feature wise (if it ever does, I'm not super keen on supporting more than the classic 7 geometry subtypes), and I am thinking about how you could use both extensions together. I think the only blockers is that they both define a GEOMETRY type which is going to cause a conflict in the catalog, but we might want to look into allowing aliased or namespaced extension imports in duckdb at some point.

Maxxen avatar Apr 06 '23 18:04 Maxxen