cuspatial
cuspatial copied to clipboard
[FEA] GeoDataFrame I/O
Is your feature request related to a problem? Please describe.
In order to support pythonic use-cases, we should provide rows, points, rpos, fpos = from_geodataframe(gpdf)
and to_geodataframe(rows, points, rpos, fpos)
.
Describe the solution you'd like Data parallel cuspatial algorithms should allow easy conversion between geopandas dataframes, such as:
import geopandas
import cuspatial
url = "http://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/ne_110m_land.geojson"
df = geopandas.read_file(url)
poly_rows, poly_points, poly_rpos, poly_fpos = cuspatial.from_geodataframe(df.iloc[0:32,:])
points = cudf.DataFrame({'x': np.random.random(1000),. 'y': np.random.random(1000)})
pip_result = cuspatial.point_in_polygon_bitmap(points, poly_points, poly_rpos, poly_fpos)
Preferably all of our notebooks and demos should be built using data that can be loaded in one line via Geopandas, then converted into cuspatial format.
Describe alternatives you've considered Requests at the Fiona/Shapely level for geometry conversions have also been made. Supporting geopandas GeoDataFrames implicitly will support Fiona/Shapely, as geopandas depends on them. Also, supporting GeoDataFrame I/O will handle larger batch jobs than at the Fiona/Shapely level, which typically deals with one geometry at a time.
It is also important to consider C++ based I/O support for GeoDataFrames. It isn't clear how to do this yet, but GeoDataFrames are also backed by GDAL/OGR low-level objects, so we should be able to write low level C++/cuda routines for better I/O performance. Eventually.
Additional context This relates to #91, and #110.
This would be a great feature. I was just trying to use points in polygon with cuSpatial. The data is already read in Geopandas as this more or less the standard way in Python now. But I can't find a way to implement cuspatial.point_in_polygon_bitmap with Point Geodataframe and Polygon Geodataframe.
So, right now there's a workaround for getting the polygons into cuspatial format: geodataframe.write_file('tempfile.shp'); cuspatial_polygons = cuspatial.read_polygon_shapefile('tempfile.shp')
.
You'll need to do a similar workaround for the points, but I'm not immediately coming up with the right function to do it. Let me think about it.
This would be a great feature. I was just trying to use points in polygon with cuSpatial. The data is already read in Geopandas as this more or less the standard way in Python now. But I can't find a way to implement cuspatial.point_in_polygon_bitmap with Point Geodataframe and Polygon Geodataframe.
@thomcom I second this. I tried following cuspatial
and still want to use it because of the speeds ups I'll get in groupby ops and spatial joins. But, I could never figure out how to get this to work. I temporarily went to pygeos
for the spatial join but would like to run my full analytic in cuspatial
.
And yes, I would like to help; but I don't know cpp. Moreover, I'd end up having more questions than answers. For example, I can't even figure out how to pass in these arguments from point_in_polygon:
polygon_end_indices: the (n+1)th vertex of the final coordinate of each
polygon in the next parameters
polygons_x: x closed coordinates of all polygon points
polygons_y: y closed coordinates of all polygon points
I may move this post into a different or multiple separate issues. This documents my ideas involving accelerated I/O, leveraging geoPandas for loading many existing file formats and moving their data to the GPU. I'll be editing this post frequently.
cuSpatial I/O
GeoPandas : Proper DataFrame in-memory storage format for geometry and meta. Easily convert from host to device memory. Can read most formats, but needs to go into host memory before it can be copied to device.
Data storage and I/O format are crucial to software interoperability.
Copy from host to device in parallel, copy from storage to host slowly.
Links to GeoPandas discussion https://github.com/geopandas/geo-arrow-spec/issues
cuDF list type
cuDF is adding the List
type in 0.15 and/or 0.16. This will not necessarily include python bindings. If it was fully supported, it might be a shorter path to geoPandas support than writing our own format. At present, I think we should press on with a custom format.
cuSpatial I/O Basics
cuSpatial depends on data. We are positioned to enable the fastest GIS algorithms in the world, but first the data has to go where it is wanted. Apache Arrow has become one of the most popular columnar memory formats. This document explains how cuSpatial represents Arrow columns as geospatial data objects.
- Start with using existing libraries to get objects into memory, then write parallel kernels to transfer the native host objects to GPU in cuSpatial format.
- As necessary, write native parallel I/O to read objects from source formats directly to GPU.
cuSpatial supports Point, Line, Polygon, MultiPoint, MultiLine, MultiPolygon, and GeometryCollection geometry objects. All "singular" objects are Multi objects with a single member. The same data structure is used for singular or multi objects.
Interleaved coordinates
Geometry objects dimensionality is specified in the dimensions
arrow file
attribute. 2D and 3D coordinates, specified with "dimensions": <2 or 3>
are
the most common case. Instead of using one structure to store each dimemsion,
coordinates are interleaved in the array: [x0, y0, x1, y1, ... xn, yn]. This
storage format will usually improve GPU performance.
I/O
- Write a cuSpatial wrapper for geopandas.read_file
- Find size of target file
- Batch if size is large
- Use geopandas.read_file to read batch
- Compute sum size of all objects of each geometry type in geopandas DataFrame
- Allocate cuDF columns for geometry objects
- Use numpy_array geometry object pointers to call CUDA kernel to move data from host to device in parallel
- Fill object metadata about sizes
- Create cuGeoDataFrame with cuDF columns for metadata, host object cuGeometry columns for geometries
Files can contain any ordering of geometry objects. They are loaded into sequential buffers in GPU memory. Any number of Point and MultiPoint objects can be read from a file. These Points will all be stored in contiguous memory in a cuDF column.
All n-D point coordinates, regardless of their geometry object source are stored in the same column. Geometry objects in cuSpatial provide offets to the coordinates, but do not store the coordinates themselves.
Metadata
cuDF columns are used to store coordinates and geometry offsets. cuDF is also used to store geometry object metadata. There are two types of metadata we are concerned with - Arrow format metadata specifying coordinate fundamentals inside of cuSpatial columns, and DataFrame metadata that is included with geometry objects but is not geometric coordinates.
Arrow Spatial Format
Introduction: Arrow, columnar format, interleaved data for performance Objects grouped by type API provides introspection into columns
Points = {
"size": 3
"dtype": int32,
"offsets": [0, 1, 2, 3],
"values_x": [x, x, x],
"values_y": [y, y, y],
"values_z": [z, z, z],
}
Points = {
"size": 3,
"dtype": int64,
"stride": 2,
"sizes": [2, 2, 2],
"offsets": [2, 4, 6],
"values": [x,y,x,y,x,y],
}
Lines = {
"size": 2,
"stride": n
"sizes": [n, m]
"offsets": [n, n+m]
"values": [x0,y0,x1,y1,xn,yn,x0,y0,x1,y2,...,ym,ym],
}
Polygons = {
"feature_size": 2,
"ring_size": [1, 1],
"ring_offsets": [n, n+m],
"rings": [0, 1, ..., n, n+1, n+2, ... m],
"values": [x0,y0,x1,y1,xn,yn,x0,y0,x1,y2,...,ym,ym]
}
GeometryCollection = {
// this is not an arrow type, but a cuSpatial type. A list of other types
}
Is there a way to add Spatial References to this information?
Hey @achapkowski spatial references are very high in the next priority list. I'm inclined to include a coordinate reference system object with every GeoSeries, or two.
you probably can just add another dictionary value:
Points = {
"size": 3,
"dtype": int64,
"stride": 2,
"sizes": [2, 2, 2],
"offsets": [2, 4, 6],
"values": [x,y,x,y,x,y],
"spatialReference" : {'wkid' : <integer>:4326}
}
or if you allow WKTs
Points = {
"size": 3,
"dtype": int64,
"stride": 2,
"sizes": [2, 2, 2],
"offsets": [2, 4, 6],
"values": [x,y,x,y,x,y],
"spatialReference" : {'wkid' : <string><WKT string>}
}
Hey @achapkowski spatial references are very high in the next priority list. I'm inclined to include a coordinate reference system object with every GeoSeries, or two.
This would be of very high value--particularly if/when we can do on-GPU coordinate conversions between at least the most common CRSs.
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.
Responding to the github-actions bot: this still relevant and is the missing link preventing a customer from making an all-GPU geospatial workflow in RAPIDS.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.