cuspatial icon indicating copy to clipboard operation
cuspatial copied to clipboard

[FEA] GeoDataFrame I/O

Open thomcom opened this issue 4 years ago • 12 comments

Is your feature request related to a problem? Please describe. In order to support pythonic use-cases, we should provide rows, points, rpos, fpos = from_geodataframe(gpdf) and to_geodataframe(rows, points, rpos, fpos).

Describe the solution you'd like Data parallel cuspatial algorithms should allow easy conversion between geopandas dataframes, such as:

import geopandas
import cuspatial
url = "http://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/ne_110m_land.geojson"
df = geopandas.read_file(url)
poly_rows, poly_points, poly_rpos, poly_fpos = cuspatial.from_geodataframe(df.iloc[0:32,:])
points = cudf.DataFrame({'x': np.random.random(1000),. 'y': np.random.random(1000)})
pip_result = cuspatial.point_in_polygon_bitmap(points, poly_points, poly_rpos, poly_fpos)

Preferably all of our notebooks and demos should be built using data that can be loaded in one line via Geopandas, then converted into cuspatial format.

Describe alternatives you've considered Requests at the Fiona/Shapely level for geometry conversions have also been made. Supporting geopandas GeoDataFrames implicitly will support Fiona/Shapely, as geopandas depends on them. Also, supporting GeoDataFrame I/O will handle larger batch jobs than at the Fiona/Shapely level, which typically deals with one geometry at a time.

It is also important to consider C++ based I/O support for GeoDataFrames. It isn't clear how to do this yet, but GeoDataFrames are also backed by GDAL/OGR low-level objects, so we should be able to write low level C++/cuda routines for better I/O performance. Eventually.

Additional context This relates to #91, and #110.

thomcom avatar Apr 01 '20 16:04 thomcom

This would be a great feature. I was just trying to use points in polygon with cuSpatial. The data is already read in Geopandas as this more or less the standard way in Python now. But I can't find a way to implement cuspatial.point_in_polygon_bitmap with Point Geodataframe and Polygon Geodataframe.

shakurgds avatar May 02 '20 17:05 shakurgds

So, right now there's a workaround for getting the polygons into cuspatial format: geodataframe.write_file('tempfile.shp'); cuspatial_polygons = cuspatial.read_polygon_shapefile('tempfile.shp').

You'll need to do a similar workaround for the points, but I'm not immediately coming up with the right function to do it. Let me think about it.

thomcom avatar May 02 '20 21:05 thomcom

This would be a great feature. I was just trying to use points in polygon with cuSpatial. The data is already read in Geopandas as this more or less the standard way in Python now. But I can't find a way to implement cuspatial.point_in_polygon_bitmap with Point Geodataframe and Polygon Geodataframe.

@thomcom I second this. I tried following cuspatial and still want to use it because of the speeds ups I'll get in groupby ops and spatial joins. But, I could never figure out how to get this to work. I temporarily went to pygeos for the spatial join but would like to run my full analytic in cuspatial.

And yes, I would like to help; but I don't know cpp. Moreover, I'd end up having more questions than answers. For example, I can't even figure out how to pass in these arguments from point_in_polygon:

polygon_end_indices: the (n+1)th vertex of the final coordinate of each
                         polygon in the next parameters
    polygons_x: x closed coordinates of all polygon points
    polygons_y: y closed coordinates of all polygon points

linwoodc3 avatar May 03 '20 01:05 linwoodc3

I may move this post into a different or multiple separate issues. This documents my ideas involving accelerated I/O, leveraging geoPandas for loading many existing file formats and moving their data to the GPU. I'll be editing this post frequently.

cuSpatial I/O

GeoPandas : Proper DataFrame in-memory storage format for geometry and meta. Easily convert from host to device memory. Can read most formats, but needs to go into host memory before it can be copied to device.

Data storage and I/O format are crucial to software interoperability.

Copy from host to device in parallel, copy from storage to host slowly.

Links to GeoPandas discussion https://github.com/geopandas/geo-arrow-spec/issues

cuDF list type

cuDF is adding the List type in 0.15 and/or 0.16. This will not necessarily include python bindings. If it was fully supported, it might be a shorter path to geoPandas support than writing our own format. At present, I think we should press on with a custom format.

cuSpatial I/O Basics

cuSpatial depends on data. We are positioned to enable the fastest GIS algorithms in the world, but first the data has to go where it is wanted. Apache Arrow has become one of the most popular columnar memory formats. This document explains how cuSpatial represents Arrow columns as geospatial data objects.

  1. Start with using existing libraries to get objects into memory, then write parallel kernels to transfer the native host objects to GPU in cuSpatial format.
  2. As necessary, write native parallel I/O to read objects from source formats directly to GPU.

cuSpatial supports Point, Line, Polygon, MultiPoint, MultiLine, MultiPolygon, and GeometryCollection geometry objects. All "singular" objects are Multi objects with a single member. The same data structure is used for singular or multi objects.

Interleaved coordinates

Geometry objects dimensionality is specified in the dimensions arrow file attribute. 2D and 3D coordinates, specified with "dimensions": <2 or 3> are the most common case. Instead of using one structure to store each dimemsion, coordinates are interleaved in the array: [x0, y0, x1, y1, ... xn, yn]. This storage format will usually improve GPU performance.

I/O

  1. Write a cuSpatial wrapper for geopandas.read_file
  2. Find size of target file
  3. Batch if size is large
  4. Use geopandas.read_file to read batch
  5. Compute sum size of all objects of each geometry type in geopandas DataFrame
  6. Allocate cuDF columns for geometry objects
  7. Use numpy_array geometry object pointers to call CUDA kernel to move data from host to device in parallel
  8. Fill object metadata about sizes
  9. Create cuGeoDataFrame with cuDF columns for metadata, host object cuGeometry columns for geometries

Files can contain any ordering of geometry objects. They are loaded into sequential buffers in GPU memory. Any number of Point and MultiPoint objects can be read from a file. These Points will all be stored in contiguous memory in a cuDF column.

All n-D point coordinates, regardless of their geometry object source are stored in the same column. Geometry objects in cuSpatial provide offets to the coordinates, but do not store the coordinates themselves.

Metadata

cuDF columns are used to store coordinates and geometry offsets. cuDF is also used to store geometry object metadata. There are two types of metadata we are concerned with - Arrow format metadata specifying coordinate fundamentals inside of cuSpatial columns, and DataFrame metadata that is included with geometry objects but is not geometric coordinates.

Arrow Spatial Format

Introduction: Arrow, columnar format, interleaved data for performance Objects grouped by type API provides introspection into columns

Points = {
  "size": 3
  "dtype": int32,
  "offsets": [0, 1, 2, 3],
  "values_x": [x, x, x],
  "values_y": [y, y, y],
  "values_z": [z, z, z],
}

Points = {
  "size": 3,
  "dtype": int64,
  "stride": 2,
  "sizes": [2, 2, 2],
  "offsets": [2, 4, 6],
  "values": [x,y,x,y,x,y],
}

Lines = {
  "size": 2,
  "stride": n
  "sizes": [n, m]
  "offsets": [n, n+m]
  "values": [x0,y0,x1,y1,xn,yn,x0,y0,x1,y2,...,ym,ym],
}

Polygons = {
  "feature_size": 2,
  "ring_size": [1, 1],
  "ring_offsets": [n, n+m],
  "rings": [0, 1, ..., n, n+1, n+2, ... m],
  "values": [x0,y0,x1,y1,xn,yn,x0,y0,x1,y2,...,ym,ym]
}

GeometryCollection = {
  // this is not an arrow type, but a cuSpatial type. A list of other types
}

thomcom avatar Jul 02 '20 21:07 thomcom

Is there a way to add Spatial References to this information?

achapkowski avatar Oct 22 '20 17:10 achapkowski

Hey @achapkowski spatial references are very high in the next priority list. I'm inclined to include a coordinate reference system object with every GeoSeries, or two.

thomcom avatar Jan 15 '21 16:01 thomcom

you probably can just add another dictionary value:

Points = {
  "size": 3,
  "dtype": int64,
  "stride": 2,
  "sizes": [2, 2, 2],
  "offsets": [2, 4, 6],
  "values": [x,y,x,y,x,y],
  "spatialReference" : {'wkid' : <integer>:4326}
}

or if you allow WKTs

Points = {
  "size": 3,
  "dtype": int64,
  "stride": 2,
  "sizes": [2, 2, 2],
  "offsets": [2, 4, 6],
  "values": [x,y,x,y,x,y],
  "spatialReference" : {'wkid' : <string><WKT string>}
}

achapkowski avatar Jan 15 '21 18:01 achapkowski

Hey @achapkowski spatial references are very high in the next priority list. I'm inclined to include a coordinate reference system object with every GeoSeries, or two.

This would be of very high value--particularly if/when we can do on-GPU coordinate conversions between at least the most common CRSs.

ChrisMLroy avatar Jan 15 '21 19:01 ChrisMLroy

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

github-actions[bot] avatar Feb 16 '21 20:02 github-actions[bot]

Responding to the github-actions bot: this still relevant and is the missing link preventing a customer from making an all-GPU geospatial workflow in RAPIDS.

ChrisMLroy avatar Feb 16 '21 20:02 ChrisMLroy

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Mar 18 '21 21:03 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Nov 23 '21 20:11 github-actions[bot]