loaders.gl icon indicating copy to clipboard operation
loaders.gl copied to clipboard

"GeoParquet" and "GeoArrow"

Open kylebarron opened this issue 4 years ago • 4 comments

This is early and preliminary but I think a discussion is warranted.

As I understand it, Parquet is the de-facto on-disk file format used with Arrow. I've been reading about new work in the Python geospatial ecosystem on "GeoParquet" files: an extension of the Parquet format to encode geometries and attributes. Here's a description of their proposed metadata format: https://github.com/geopandas/geopandas/pull/1180#issuecomment-548784939. (As background: GeoPandas is the main package in Python working with tabular geospatial data. Each row is a collection of attributes and one geometry.)

Given Arrow's position as a language-independent framework, I think there are potential gains to be had by engaging with other implementations to make sure that geospatial data are similarly broadly interpretable cross languages.

Regarding Arrow, I haven't found any related projects on Github, but it could be of interest to try to develop a specification for how to store geographic data in an Arrow Table. For example in the above GeoParquet specification, geometries are encoded as Well Known Binary within each row. If stored similarly in Arrow, where each row is a collection of attributes and a condensed geometry as a ByteArray, that might be more memory-efficient but then less immediately useful given that it couldn't immediately be uploaded to a GPU.

Regarding Parquet, it could be worth exploring in the future if it might have benefit/be possible to incorporate as a data format used in the browser. At first glance, it seems promising because of its extremely efficient file sizes and fast reads, which could decrease network transfer sizes while still being fast. I believe the efficient compression is often due to, e.g. Snappy compression, which isn't necessarily easy to access in the browser. There's still potential to compile one of the C++/Go/Rust snappy libraries to WASM and use it.

Overall, this is just to start discussion and think about how Arrow is used elsewhere, so that work in loaders.gl can leverage other parts of the ecosystem in the future.

kylebarron avatar Apr 04 '20 20:04 kylebarron

Edit looks like the format discussion was moved to here: https://github.com/geopandas/geopandas/pull/1191

kylebarron avatar Apr 04 '20 20:04 kylebarron

New repository in the geopandas org as a place for discussion on formats: https://github.com/geopandas/geo-arrow-spec

kylebarron avatar Apr 07 '20 21:04 kylebarron

@kylebarron this is an old issue, but what is the current status of this in loaders.gl?

  • I see Parquet support was added in https://github.com/visgl/loaders.gl/pull/2103, but that is just loading data from Parquet, or does it also handle actual GeoParquet files? (reading the metadata, recognizing the geometry column, parsing the WKB, etc so that a visualization library can work with the data)

  • In https://github.com/opengeospatial/geoparquet/pull/32#issuecomment-1062272088, you mentioned:

    One of the goals of loaders.gl is to present a unified output format so that libraries like deck.gl can consume data from any source seamlessly. (For now, this means often parsing data to GeoJSON to pass to deck.gl, but ideally in the future loaders.gl would add GeoArrow as a choice of output format).

    Is there another issue tracking this idea? Or is there progress on this?

jorisvandenbossche avatar Jun 03 '22 11:06 jorisvandenbossche

Hey @jorisvandenbossche. Yes, #2103 was an initial implementation of full-buffer Parquet decoding. At this point, it preserves metadata from Parquet to the Arrow table (I think; I should double check this). But it doesn't do any geospatial-specific handling beyond that. Loaders.gl also has a JS WKB parsing implementation, so at this point, a user would need to parse the Parquet file into an Arrow table and iterate over the column in the Arrow table to convert geometries to GeoJSON.

loaders.gl has the concept of "output formats", so that for specific loaders a user can choose what kind of output they desire. A next step for the geoparquet loader would be to add a geojson output format option, which would automatically convert to GeoJSON.

I don't think there's an issue tracking GeoArrow and deck.gl specifically. But it might work pretty much out of the box with Deck.gl's existing binary support and with the native list-of-lists encoding.

kylebarron avatar Jun 09 '22 17:06 kylebarron

largely implemented.

ibgreen avatar Apr 02 '24 20:04 ibgreen