cuspatial
cuspatial copied to clipboard
[BUG]: Conversion to geopandas (and thus repr) fails if there are missing values
Version
23.06.00a+39.g86e8912a
On which installation method(s) does this occur?
Conda
Describe the issue
When you have a GeoSeries of points with coordinates that have missing values (null, not nan), the repr / conversion to geopandas fails.
Minimum reproducible example
Creating the following small GeoSeries of points:
gs = cuspatial.GeoSeries.from_points_xy(cudf.Series([0.1, 0.2, None, None]))
Actual coordinates stored under the hood:
>>> gs.points._col
<cudf.core.column.lists.ListColumn object at 0x7fc20bbfd040>
[
[
0.1,
0.2
],
[
null,
null
]
]
dtype: list
Trying to convert such a GeoSeries to geopandas fails:
>>> gs.to_geopandas()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-51-32fb058c059d> in <cell line: 1>()
----> 1 gs.to_geopandas()
3 frames
/usr/local/lib/python3.9/site-packages/shapely/geometry/point.py in <listcomp>(.0)
75 )
76 if not np.issubdtype(coords.dtype, np.number):
---> 77 coords = [float(c) for c in coords]
78 geom = shapely.points(coords)
79 if not isinstance(geom, Point):
TypeError: float() argument must be a string or a number, not 'NoneType'
And since the repr calls that under the hood, also just displaying the GeoSeries fails.
Environment details
Installed from rapids-nightly conda channel on Colab
I suppose "null" coordinate values are probably kind of disallowed? Although various algorithms (eg point in polygon related ones) seem to work fine with point input containing nulls.
Is it recommended to convert them to NaN instead? (or does that only impact the repr, since then the conversion to shapely/geopandas works by considering them as empty points)
>>> gs = cuspatial.GeoSeries.from_points_xy(cudf.Series([0.1, 0.2, None, None]).fillna(np.nan))
>>> gs
0 POINT (0.10000 0.20000)
1 POINT EMPTY
dtype: geometry
Hey @jorisvandenbossche ! I was unaware that .to_geopandas
failed with Series with None
rows, in fact I'm pretty close to 100% sure it is a regression since I would have developed the feature using __repr__
in exactly this context. I'll try to repro tomorrow morning.
However note that at the C++ level our algorithms do not currently support nulls.
I think this is pretty straightforward. A cudf.Series
is nullable and there's no guardrails against constructing coordinates with a nullable series. Can think of 2 ways:
- from_points_xy detects if input is nullable and throws
- Implicitly calls
fillna
and turns nulls into nans.
Leaning towards 1. Because only when both coordinates are nans can it be interpreted as a empty point, aka POINT(NONE, NONE)
means POINT(nan nan)
, which is POINT EMPTY
. In the case of POINT(NONE, 3.14)
, filling it to be nan
could lead to unwanted result from the user's perspective.
In a sense this issue is that shapely doesn't support null
, coordinates must be a number as the error says. cuSpatial allows None
to be specified at the row-level, or None or np.nan to be specified at the data-level:
a = cuspatial.GeoSeries([Point(0, 0), None])
b = cuspatial.GeoSeries.from_points_xy(cudf.Series([0.0, 1, None, None]))
c = cuspatial.GeoSeries.from_points_xy(cudf.Series([0.0, 1, np.nan, np.nan]))
Whether or not these values are properly supported by certain algorithms is an implementation detail. In this case the issue is only for value b
which cannot be printed because Shapely does not permit a Point
to be created containing null as per the error above.
While mixed results like POINT(NONE, 3.14)
are likely to produce bad behavior downstream, support for null
and nan
are almost always requested.
My inclination is to use fillna
before serializing to Shapely. It is just a viewer-level I/O problem, and printing Point(nan, 3.14)
is preferable from a user-perspective than throwing an I/O error.
cudf
treats null
and nan
the same in the Series constructor, but will insert nan
via .fillna
:
cudf.Series([0.0, np.nan])
0 0.0
1 <NA>
dtype: float64
cudf.Series([0.0, None])
0 0.0
1 <NA>
dtype: float64
cudf.Series([0.0, np.nan]).fillna(np.nan)
0 0.0
1 NaN
dtype: float64
cuSpatial allows None to be specified at the row-level, or None or np.nan to be specified at the data-level.
Thanks for the clarification, I didn't know this is by design. I feel like we hit a classic nan_as_null
issue in cudf. If you construct the series with nan_as_null=False
, you would get nans in the series instead.
Moreover, according to geoarrow spec: https://github.com/geoarrow/geoarrow/blob/main/format.md#missing-values-nulls
For this specification, only the outer level is allowed to have nulls, and all other levels (including the inner level with the actual coordinate values) should not contain any nulls. Those fields can be marked explicitly as non-nullable, but this is not required.
It seems like geoarrow prefer using nans to nulls to mark inner levels with invalid/missing values.
Do we have a plan for this?
Yes, we just need a fillna that works with ListColumn
. I have a PR at cudf
as linked above that might get added there, but we could just add it directly to cuspatial if you want to move it along. We'll have to wait until 23.06 of cudf before we can rely on any PRs to be merged there, so it is probably better to write our own fix simultaneously and update it to use the cudf method when that is merged.
I think we should do it right (e.g. in cuDF -- unless it's needed in the cuSpatial C++ APIs). 23.06 burndown starts tomorrow for cuDF.