databend icon indicating copy to clipboard operation
databend copied to clipboard

Feature: A columnar Geometry/Geography format

Open forsaken628 opened this issue 1 year ago • 9 comments

Summary A columnar Geometry/Geography format is proposed here to provide support for the storage and computation of geospatial features. https://github.com/forsaken628/databend/blob/9e132699cc2238b8347ca7414a9e17a099d37de4/src/common/geobuf/src/lib.rs

This column format has five columns, namely the point column, which consists of the x column, the y column, and the point_offset column, and the rest of the information is serialized in a binary column, which consists of the data column and the offset column, see Column

Format compatibility

EWKB supports up to 4 dimensions, here only 2 dimensions are supported, in line with snowflake, and new types can be added to support higher dimensions. https://docs.snowflake.com/en/sql-reference/data-types-geospatial#geometry-data-type

See geo.fbs for details on other compatibility designs.

References WKT WKB EWKT EWKB GeoJSON Spec. https://libgeos.org/specifications/wkb/#standard-wkb https://datatracker.ietf.org/doc/html/rfc7946

Why the columnar Geometry/Geography format?

  • For smaller storage volumes, see the test below.
  • The problem of compression coding of floating-point columns has a large number of readily available research results that can be introduced at low cost, which is conducive to compression of storage space and improvement of io efficiency.
  • Floating-point columns have min max sparse indexes, very similar to the R tree indexes widely used in geospatial features, and are very cheap to implement and maintain.
  • Facilitates the implementation of filtered push-down to the storage layer, and vectorized computation.
  • On the downside, geospatial functions become significantly more expensive to implement and maintain, requiring a significant amount of work.

Why use flatbuffer?

  • flatbuffer provides delayed deserialization , and partial deserialization capabilities , to facilitate the writing of high-performance implementation .
  • In addition, code generation is easy to adapt to the design.
  • However, flatbuffer has its drawbacks. Similar to protobuffer, flatbuffer is an extensible, compatible format, and in order to provide extensibility and compatibility, a vtable is deposited in the data, resulting in an inflated size.
  • Once the design is stabilized, consider implementing it in other ways.

forsaken628 avatar Jul 28 '24 16:07 forsaken628

FYI: consider GeoParquet that Snowflake is also involved in.

ariesdevil avatar Jul 29 '24 10:07 ariesdevil

GeoParquet 1.1 support Native encoding Performance test result with GeoParquet/GeoPackage/FlatgeoBuf/GeoJson/Shapefile/GeoArrow

kkk25641463 avatar Jul 29 '24 23:07 kkk25641463

Compare with geoparquet

In postgis, the geography/geometry type supports spatial type modifier, the spatial type modifier restricts the kind of shapes and dimensions allowed in the column. Should we provide counterpart support?

For generalized geography/geometry with no spatial type specified, geoparquet requires encoding as WKB, which is consistent with the current databend geometry implementation.

For geography/geometry with a specified spatial type, use geoarrow's memory layout

The following compares the memory layout of the geoarrow with that of the geobuf

Coordinate (interleaved): FixedSizeList<double>[n_dim] Currently databend does not implement FixedSizeList, should we want to consider introducing it?

Coordinate (separated): Struct<x: double, y: double, [z: double, [m: double>]] Consistent memory layout

Point: Coordinate For generalization, use List<Coordinate> instead

MultiPoint: List<Coordinate> Consistent memory layout

LineString: List<Coordinate> Consistent memory layout

MultiLineString: List<List<Coordinate>> Encoded point_offsets into buf column

Polygon: List<List<Coordinate>> Encoded point_offsets into buf column

MultiPolygon: List<List<List<Coordinate>>> Encoded point_offsets and ring_offsets into buf column

forsaken628 avatar Jul 30 '24 03:07 forsaken628

FYI: Parquet community is working together with GeoParquet and Iceberg community to propose a new geometry logical type: https://github.com/apache/parquet-format/pull/240

wgtmac avatar Jul 30 '24 03:07 wgtmac

Compare with geoparquet

In postgis, the geography/geometry type supports spatial type modifier, the spatial type modifier restricts the kind of shapes and dimensions allowed in the column. Should we provide counterpart support?

For generalized geography/geometry with no spatial type specified, geoparquet requires encoding as WKB, which is consistent with the current databend geometry implementation.

For geography/geometry with a specified spatial type, use geoarrow's memory layout

The following compares the memory layout of the geoarrow with that of the geobuf

Coordinate (interleaved): FixedSizeList<double>[n_dim] Currently databend does not implement FixedSizeList, should we want to consider introducing it?

Coordinate (separated): Struct<x: double, y: double, [z: double, [m: double>]] Consistent memory layout

Point: Coordinate For generalization, use List<Coordinate> instead

MultiPoint: List<Coordinate> Consistent memory layout

LineString: List<Coordinate> Consistent memory layout

MultiLineString: List<List<Coordinate>> Encoded point_offsets into buf column

Polygon: List<List<Coordinate>> Encoded point_offsets into buf column

MultiPolygon: List<List<List<Coordinate>>> Encoded point_offsets and ring_offsets into buf column

I think subtypes is important, but that's a little complicated and aren't supported by community

kkk25641463 avatar Jul 30 '24 06:07 kkk25641463

FYI: Parquet community is working together with GeoParquet and Iceberg community to propose a new geometry logical type: apache/parquet-format#240

I'd still prefer something like geoarrow. https://github.com/opengeospatial/geoparquet/issues/222#issuecomment-2128298217

I noticed that geoarrow suggests using union type to solve the problem of mixed-types, but personally I think that union type is too wasteful and needs to be filled with a lot of zeros.

We could also use List<List<List<Coordinate>>> would that be better? Also GeometryCollection is still not unexpanded.

forsaken628 avatar Jul 30 '24 10:07 forsaken628

If this feature is for in-memory processing, not a file format, and if databend is otherwise using Arrow, then I would strongly recommend you incorporate GeoArrow, and not create your own custom flatbuffer encoding within an Arrow binary column. I've been working on a GeoArrow implementation in Rust for a couple years. It's not yet fully stable but it conforms to the GeoArrow spec and integrates with geo and geos for processing (and contributions are welcome).

kylebarron avatar Jul 30 '24 14:07 kylebarron

If we wants to insist on

  1. a columnar memory layout (providing possibilities for vectorized calculations)
  2. support for GeometryCollection (which is a recursive type)
  3. not using adding subtypes to shift the focus of work then the row-column hybrid memory layout should be the only option, otherwise we always have to drop one or the other.

How to put the extra offset is an implementation point that can be fine-tuned.

  • Putting it in the buf is the current implementation.
  • DenseUnion variant Struct<Offset,Offset,List<Coordinate>>.
  • SparseUnion variant List<List<List<Coordinate>>. But I don't think the benefits from implementing Union cover the complexity of implementing Union, unless Union has other uses.

forsaken628 avatar Jul 31 '24 03:07 forsaken628

Tracking:

  • [ ] initial implement geobuf format. #16179
  • [ ] initial support geography type. #16179
  • [ ] add tests to cover the main application scenarios for geospatial features.
  • [ ] determine the implementation details of geobuf format, including memory layout, sorting logic, comparison logic, serialization, etc.
  • [ ] implement GeometryStatistics.

forsaken628 avatar Aug 09 '24 07:08 forsaken628