Feature: A columnar Geometry/Geography format
Summary A columnar Geometry/Geography format is proposed here to provide support for the storage and computation of geospatial features. https://github.com/forsaken628/databend/blob/9e132699cc2238b8347ca7414a9e17a099d37de4/src/common/geobuf/src/lib.rs
This column format has five columns, namely the point column, which consists of the x column, the y column, and the point_offset column, and the rest of the information is serialized in a binary column, which consists of the data column and the offset column, see Column
Format compatibility
EWKB supports up to 4 dimensions, here only 2 dimensions are supported, in line with snowflake, and new types can be added to support higher dimensions. https://docs.snowflake.com/en/sql-reference/data-types-geospatial#geometry-data-type
See geo.fbs for details on other compatibility designs.
References WKT WKB EWKT EWKB GeoJSON Spec. https://libgeos.org/specifications/wkb/#standard-wkb https://datatracker.ietf.org/doc/html/rfc7946
Why the columnar Geometry/Geography format?
- For smaller storage volumes, see the test below.
- The problem of compression coding of floating-point columns has a large number of readily available research results that can be introduced at low cost, which is conducive to compression of storage space and improvement of io efficiency.
- Floating-point columns have min max sparse indexes, very similar to the R tree indexes widely used in geospatial features, and are very cheap to implement and maintain.
- Facilitates the implementation of filtered push-down to the storage layer, and vectorized computation.
- On the downside, geospatial functions become significantly more expensive to implement and maintain, requiring a significant amount of work.
Why use flatbuffer?
- flatbuffer provides delayed deserialization , and partial deserialization capabilities , to facilitate the writing of high-performance implementation .
- In addition, code generation is easy to adapt to the design.
- However, flatbuffer has its drawbacks. Similar to protobuffer, flatbuffer is an extensible, compatible format, and in order to provide extensibility and compatibility, a vtable is deposited in the data, resulting in an inflated size.
- Once the design is stabilized, consider implementing it in other ways.
FYI: consider GeoParquet that Snowflake is also involved in.
GeoParquet 1.1 support Native encoding Performance test result with GeoParquet/GeoPackage/FlatgeoBuf/GeoJson/Shapefile/GeoArrow
Compare with geoparquet
In postgis, the geography/geometry type supports spatial type modifier, the spatial type modifier restricts the kind of shapes and dimensions allowed in the column. Should we provide counterpart support?
For generalized geography/geometry with no spatial type specified, geoparquet requires encoding as WKB, which is consistent with the current databend geometry implementation.
For geography/geometry with a specified spatial type, use geoarrow's memory layout
The following compares the memory layout of the geoarrow with that of the geobuf
Coordinate (interleaved): FixedSizeList<double>[n_dim]
Currently databend does not implement FixedSizeList, should we want to consider introducing it?
Coordinate (separated): Struct<x: double, y: double, [z: double, [m: double>]]
Consistent memory layout
Point: Coordinate
For generalization, use List<Coordinate> instead
MultiPoint: List<Coordinate>
Consistent memory layout
LineString: List<Coordinate>
Consistent memory layout
MultiLineString: List<List<Coordinate>>
Encoded point_offsets into buf column
Polygon: List<List<Coordinate>>
Encoded point_offsets into buf column
MultiPolygon: List<List<List<Coordinate>>>
Encoded point_offsets and ring_offsets into buf column
FYI: Parquet community is working together with GeoParquet and Iceberg community to propose a new geometry logical type: https://github.com/apache/parquet-format/pull/240
Compare with geoparquet
In postgis, the geography/geometry type supports spatial type modifier, the spatial type modifier restricts the kind of shapes and dimensions allowed in the column. Should we provide counterpart support?
For generalized geography/geometry with no spatial type specified, geoparquet requires encoding as WKB, which is consistent with the current databend geometry implementation.
For geography/geometry with a specified spatial type, use geoarrow's memory layout
The following compares the memory layout of the geoarrow with that of the geobuf
Coordinate (interleaved):
FixedSizeList<double>[n_dim]Currently databend does not implement FixedSizeList, should we want to consider introducing it?Coordinate (separated):
Struct<x: double, y: double, [z: double, [m: double>]]Consistent memory layoutPoint:
CoordinateFor generalization, useList<Coordinate>insteadMultiPoint:
List<Coordinate>Consistent memory layoutLineString:
List<Coordinate>Consistent memory layoutMultiLineString:
List<List<Coordinate>>Encoded point_offsets into buf columnPolygon:
List<List<Coordinate>>Encoded point_offsets into buf columnMultiPolygon:
List<List<List<Coordinate>>>Encoded point_offsets and ring_offsets into buf column
I think subtypes is important, but that's a little complicated and aren't supported by community
FYI: Parquet community is working together with GeoParquet and Iceberg community to propose a new geometry logical type: apache/parquet-format#240
I'd still prefer something like geoarrow. https://github.com/opengeospatial/geoparquet/issues/222#issuecomment-2128298217
I noticed that geoarrow suggests using union type to solve the problem of mixed-types, but personally I think that union type is too wasteful and needs to be filled with a lot of zeros.
We could also use List<List<List<Coordinate>>> would that be better? Also GeometryCollection is still not unexpanded.
If this feature is for in-memory processing, not a file format, and if databend is otherwise using Arrow, then I would strongly recommend you incorporate GeoArrow, and not create your own custom flatbuffer encoding within an Arrow binary column. I've been working on a GeoArrow implementation in Rust for a couple years. It's not yet fully stable but it conforms to the GeoArrow spec and integrates with geo and geos for processing (and contributions are welcome).
If we wants to insist on
- a columnar memory layout (providing possibilities for vectorized calculations)
- support for GeometryCollection (which is a recursive type)
- not using adding subtypes to shift the focus of work then the row-column hybrid memory layout should be the only option, otherwise we always have to drop one or the other.
How to put the extra offset is an implementation point that can be fine-tuned.
- Putting it in the buf is the current implementation.
- DenseUnion variant
Struct<Offset,Offset,List<Coordinate>>. - SparseUnion variant
List<List<List<Coordinate>>. But I don't think the benefits from implementing Union cover the complexity of implementing Union, unless Union has other uses.
Tracking:
- [ ] initial implement geobuf format. #16179
- [ ] initial support geography type. #16179
- [ ] add tests to cover the main application scenarios for geospatial features.
- [ ] determine the implementation details of geobuf format, including memory layout, sorting logic, comparison logic, serialization, etc.
- [ ] implement GeometryStatistics.