cuspatial icon indicating copy to clipboard operation
cuspatial copied to clipboard

[BUG]: Conversion to geopandas (and thus repr) fails if there are missing values

Open jorisvandenbossche opened this issue 1 year ago • 10 comments

Version

23.06.00a+39.g86e8912a

On which installation method(s) does this occur?

Conda

Describe the issue

When you have a GeoSeries of points with coordinates that have missing values (null, not nan), the repr / conversion to geopandas fails.

Minimum reproducible example

Creating the following small GeoSeries of points:

gs = cuspatial.GeoSeries.from_points_xy(cudf.Series([0.1, 0.2, None, None]))

Actual coordinates stored under the hood:

>>> gs.points._col
<cudf.core.column.lists.ListColumn object at 0x7fc20bbfd040>
[
  [
    0.1,
    0.2
  ],
  [
    null,
    null
  ]
]
dtype: list

Trying to convert such a GeoSeries to geopandas fails:

>>> gs.to_geopandas()
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-51-32fb058c059d> in <cell line: 1>()
----> 1 gs.to_geopandas()

3 frames

/usr/local/lib/python3.9/site-packages/shapely/geometry/point.py in <listcomp>(.0)
     75             )
     76         if not np.issubdtype(coords.dtype, np.number):
---> 77             coords = [float(c) for c in coords]
     78         geom = shapely.points(coords)
     79         if not isinstance(geom, Point):

TypeError: float() argument must be a string or a number, not 'NoneType'

And since the repr calls that under the hood, also just displaying the GeoSeries fails.

Environment details

Installed from rapids-nightly conda channel on Colab

jorisvandenbossche avatar Apr 27 '23 08:04 jorisvandenbossche

I suppose "null" coordinate values are probably kind of disallowed? Although various algorithms (eg point in polygon related ones) seem to work fine with point input containing nulls.

Is it recommended to convert them to NaN instead? (or does that only impact the repr, since then the conversion to shapely/geopandas works by considering them as empty points)

>>> gs = cuspatial.GeoSeries.from_points_xy(cudf.Series([0.1, 0.2, None, None]).fillna(np.nan))
>>> gs
0    POINT (0.10000 0.20000)
1                POINT EMPTY
dtype: geometry

jorisvandenbossche avatar Apr 27 '23 08:04 jorisvandenbossche

Hey @jorisvandenbossche ! I was unaware that .to_geopandas failed with Series with None rows, in fact I'm pretty close to 100% sure it is a regression since I would have developed the feature using __repr__ in exactly this context. I'll try to repro tomorrow morning.

thomcom avatar Apr 27 '23 22:04 thomcom

However note that at the C++ level our algorithms do not currently support nulls.

harrism avatar Apr 27 '23 22:04 harrism

I think this is pretty straightforward. A cudf.Series is nullable and there's no guardrails against constructing coordinates with a nullable series. Can think of 2 ways:

  1. from_points_xy detects if input is nullable and throws
  2. Implicitly calls fillna and turns nulls into nans.

Leaning towards 1. Because only when both coordinates are nans can it be interpreted as a empty point, aka POINT(NONE, NONE) means POINT(nan nan), which is POINT EMPTY. In the case of POINT(NONE, 3.14), filling it to be nan could lead to unwanted result from the user's perspective.

isVoid avatar Apr 28 '23 01:04 isVoid

In a sense this issue is that shapely doesn't support null, coordinates must be a number as the error says. cuSpatial allows None to be specified at the row-level, or None or np.nan to be specified at the data-level:

a = cuspatial.GeoSeries([Point(0, 0), None])
b = cuspatial.GeoSeries.from_points_xy(cudf.Series([0.0, 1, None, None]))
c = cuspatial.GeoSeries.from_points_xy(cudf.Series([0.0, 1, np.nan, np.nan]))

Whether or not these values are properly supported by certain algorithms is an implementation detail. In this case the issue is only for value b which cannot be printed because Shapely does not permit a Point to be created containing null as per the error above.

While mixed results like POINT(NONE, 3.14) are likely to produce bad behavior downstream, support for null and nan are almost always requested.

thomcom avatar Apr 28 '23 13:04 thomcom

My inclination is to use fillna before serializing to Shapely. It is just a viewer-level I/O problem, and printing Point(nan, 3.14) is preferable from a user-perspective than throwing an I/O error.

cudf treats null and nan the same in the Series constructor, but will insert nan via .fillna:

cudf.Series([0.0, np.nan])
0     0.0
1    <NA>
dtype: float64

cudf.Series([0.0, None])
0     0.0
1    <NA>
dtype: float64

cudf.Series([0.0, np.nan]).fillna(np.nan)
0    0.0
1    NaN
dtype: float64

thomcom avatar Apr 28 '23 13:04 thomcom

cuSpatial allows None to be specified at the row-level, or None or np.nan to be specified at the data-level.

Thanks for the clarification, I didn't know this is by design. I feel like we hit a classic nan_as_null issue in cudf. If you construct the series with nan_as_null=False, you would get nans in the series instead.

Moreover, according to geoarrow spec: https://github.com/geoarrow/geoarrow/blob/main/format.md#missing-values-nulls

For this specification, only the outer level is allowed to have nulls, and all other levels (including the inner level with the actual coordinate values) should not contain any nulls. Those fields can be marked explicitly as non-nullable, but this is not required.

It seems like geoarrow prefer using nans to nulls to mark inner levels with invalid/missing values.

isVoid avatar Apr 28 '23 16:04 isVoid

Do we have a plan for this?

harrism avatar May 16 '23 20:05 harrism

Yes, we just need a fillna that works with ListColumn. I have a PR at cudf as linked above that might get added there, but we could just add it directly to cuspatial if you want to move it along. We'll have to wait until 23.06 of cudf before we can rely on any PRs to be merged there, so it is probably better to write our own fix simultaneously and update it to use the cudf method when that is merged.

thomcom avatar May 17 '23 17:05 thomcom

I think we should do it right (e.g. in cuDF -- unless it's needed in the cuSpatial C++ APIs). 23.06 burndown starts tomorrow for cuDF.

harrism avatar May 17 '23 23:05 harrism