loaders.gl
loaders.gl copied to clipboard
"GeoParquet" and "GeoArrow"
This is early and preliminary but I think a discussion is warranted.
As I understand it, Parquet is the de-facto on-disk file format used with Arrow. I've been reading about new work in the Python geospatial ecosystem on "GeoParquet" files: an extension of the Parquet format to encode geometries and attributes. Here's a description of their proposed metadata format: https://github.com/geopandas/geopandas/pull/1180#issuecomment-548784939. (As background: GeoPandas is the main package in Python working with tabular geospatial data. Each row is a collection of attributes and one geometry.)
Given Arrow's position as a language-independent framework, I think there are potential gains to be had by engaging with other implementations to make sure that geospatial data are similarly broadly interpretable cross languages.
Regarding Arrow, I haven't found any related projects on Github, but it could be of interest to try to develop a specification for how to store geographic data in an Arrow Table. For example in the above GeoParquet specification, geometries are encoded as Well Known Binary within each row. If stored similarly in Arrow, where each row is a collection of attributes and a condensed geometry as a ByteArray, that might be more memory-efficient but then less immediately useful given that it couldn't immediately be uploaded to a GPU.
Regarding Parquet, it could be worth exploring in the future if it might have benefit/be possible to incorporate as a data format used in the browser. At first glance, it seems promising because of its extremely efficient file sizes and fast reads, which could decrease network transfer sizes while still being fast. I believe the efficient compression is often due to, e.g. Snappy compression, which isn't necessarily easy to access in the browser. There's still potential to compile one of the C++/Go/Rust snappy libraries to WASM and use it.
Overall, this is just to start discussion and think about how Arrow is used elsewhere, so that work in loaders.gl can leverage other parts of the ecosystem in the future.
Edit looks like the format discussion was moved to here: https://github.com/geopandas/geopandas/pull/1191
New repository in the geopandas
org as a place for discussion on formats: https://github.com/geopandas/geo-arrow-spec
@kylebarron this is an old issue, but what is the current status of this in loaders.gl?
-
I see Parquet support was added in https://github.com/visgl/loaders.gl/pull/2103, but that is just loading data from Parquet, or does it also handle actual GeoParquet files? (reading the metadata, recognizing the geometry column, parsing the WKB, etc so that a visualization library can work with the data)
-
In https://github.com/opengeospatial/geoparquet/pull/32#issuecomment-1062272088, you mentioned:
One of the goals of loaders.gl is to present a unified output format so that libraries like deck.gl can consume data from any source seamlessly. (For now, this means often parsing data to GeoJSON to pass to deck.gl, but ideally in the future loaders.gl would add GeoArrow as a choice of output format).
Is there another issue tracking this idea? Or is there progress on this?
Hey @jorisvandenbossche. Yes, #2103 was an initial implementation of full-buffer Parquet decoding. At this point, it preserves metadata from Parquet to the Arrow table (I think; I should double check this). But it doesn't do any geospatial-specific handling beyond that. Loaders.gl also has a JS WKB parsing implementation, so at this point, a user would need to parse the Parquet file into an Arrow table and iterate over the column in the Arrow table to convert geometries to GeoJSON.
loaders.gl has the concept of "output formats", so that for specific loaders a user can choose what kind of output they desire. A next step for the geoparquet loader would be to add a geojson
output format option, which would automatically convert to GeoJSON.
I don't think there's an issue tracking GeoArrow and deck.gl specifically. But it might work pretty much out of the box with Deck.gl's existing binary support and with the native list-of-lists encoding.
largely implemented.