geoarrow icon indicating copy to clipboard operation
geoarrow copied to clipboard

Missing geometries

Open etiennebr opened this issue 4 years ago • 10 comments

I think handling missing data across engines might be a challenge. I see four levels of missingness:

  • Geometry is missing. The location is unknown, but there could be other observations attached.
  • Coordinates are missing. For example, while tracking a vehicle, the GPS was blocked for a few seconds.
  • One or more coordinate is missing. For example, while tracking a vehicle the GPS was blocked, but the zm coordinates are from another sensor and are valid (e.g. altimeter and clock).
  • Empty geometry. The intersection of two disjoint polygons, for example.

To my knowledge, only missing geometries and empty geometries are supported in general (maybe moving pandas support missing coordinates?).

From geopandas:

Empty geometries are actual geometry objects but that have no coordinates (and thus also no area, for example). They can, for example, originate from taking the intersection of two polygons that have no overlap. The scalar object (when accessing a single element of a GeoSeries) is still a Shapely geometry object. Missing geometries are unknown values in a GeoSeries. They will typically be propagated in operations (for example in calculations of the area or of the intersection), or ignored in reductions such as unary_union. The scalar object (when accessing a single element of a GeoSeries) is the Python None object.

In sf, I believe mostly because the OGC standard isn't clear (that could be one reason to depart from wkb), we don't have such a clear definition. Most of the time it uses EMPTY, except for points which support NA.

etiennebr avatar Apr 18 '20 12:04 etiennebr

From the point of view of the storage (Arrow memory layout, Parquet file format), missingness in general is certainly supported, I think, and several missingness levels can also be possible in principle.

The first level "geometry is missing" is clear I think: this is represented in Arrow/Parquet as a null value in the column/array. So for the spec, IMO that should be the way to have missing geometries (it's still up to implementations to support reading/writing such missing values, of course. Would that be a problem for sf?).

The other levels of missingness are, however, dependent on the exact binary encoding of the geometry, I think (xref https://github.com/geopandas/geo-arrow-spec/issues/3).

With the initial spec using WKB, you either have a missing geometry, or either a valid WKB value. So then it is up to WKB to see what is supported. From what I know: missing coordinates might be represented by NaN coordinates? Empty geometries are supported, except for empty points (which eg Postgis works around by using Point(nan, nan) as empty point).

Other binary encodings, like the interleaved x,y coordinates or nested list-array, should in principle (based on the Arrow ListArray layout) also support both missing coordinates as empty geometries.

jorisvandenbossche avatar Apr 18 '20 13:04 jorisvandenbossche

I confirm at this point, that null geometries would be a problem to round-trip because they're converted to EMPTY in sf. That's the actual situation when we read from postgis.

st_read(con, query = "select null::geometry")  # postgis returns null
#' [...]
#'                   geometry
#' 1 GEOMETRYCOLLECTION EMPTY
> st_read(con, query = "select st_point('NaN', 'NaN');")  # postgis returns POINT(NaN, NaN)
#' [...]
#'      st_point
#' 1 POINT EMPTY

Same thing for missing coordinates:

st_point(c(NA_real_, NA_real_))
#' POINT EMPTY
st_point(c(NA_real_, NA_real_, NA_real_))
#' POINT Z EMPTY
st_point(c(NA_real_, NA_real_, NA_real_, NA_real_))
#' POINT ZM EMPTY
st_point(c(1, 2, NA_real_, NA_real_))
#' POINT ZM EMPTY

If I recall correctly, most of these constraints come from GEOS and GDAL/OGR. How does geopandas handle them?

Paging @edzer since he might have additional comments.

etiennebr avatar Apr 18 '20 19:04 etiennebr

I agree that semantically there is a difference between

  • an empty geometry, which is an empty point set (there is no point for which...)
  • a missing geometry, there is a geometry but we don't know anything about it (R's NA - not available)

and that R package sf, following the simple feature access specification, only has the first. For the second type, you would want that measures like st_area or st_length return NA, and not 0 as they rightly do for empty geometries.

I recall another discussion about the NULL geometries (forgot where): I'm not sure NULL geometries refer to the case where a geometry exists but is missing, or are they language artefacts (pointers in C++ can always be NULL, fields in a DBMS also). Do length and area of NULLs return NA's in PostGIS?

Since they implement simple features, I believe GDAL or GEOS don't support missing geometries.

edzer avatar Apr 19 '20 08:04 edzer

The POINT EMPTY geometry has afaict no WKB representation in the standard (nobody is perfect); different environments have different solutions, sf, like PostGIS, chose a point with NA (or NaN) values; GEOS does sth different (leading to this kludge).

edzer avatar Apr 19 '20 08:04 edzer

Revisiting this because I ran into these issues prototyping the Arrow nested list representation of geometry. Most of this is already been put into words in #12, but we do have to formalize this pretty soon and this thread has good discussion already.

From the original query:

Geometry is missing. The location is unknown, but there could be other observations attached.

The proposed memory layout allows for this: the outermost Arrow Array can be nullable and can encode missing values as NULLs.

Coordinates are missing. For example, while tracking a vehicle, the GPS was blocked for a few seconds.

This is interesting because the proposed memory layout can in theory encode nullability at every level of the nesting which isn't true for something like WKB. It results in some code complexity because checking the validity buffer for a NULL at every child element has a cost (although in practice it has almost no cost for arrays that have no NULLs because the validity buffer is set to NULL or the null_count is set to 0). I think the wording in #12 is that child arrays should not be nullable, but it's fairly cheap to assert that all elements of an array are non-null and readers should be checking that anyway (or they may get uninitialized values in buffers).

One or more coordinate is missing. For example, while tracking a vehicle the GPS was blocked, but the zm coordinates are from another sensor and are valid (e.g. altimeter and clock).

Theoretically the innermost child array (a big flat buffer of doubles) containing coordinate values can also be nullable and have null elements, but I think here NaN would be unambiguous (except for points, where filling every dimension with NaN might be interpreted as an empty point).

Empty geometry. The intersection of two disjoint polygons, for example.

For all geometry types except a POINT, we're representing them as a variable-length list, so we can represent an empty geometry as a list with zero child elements. For POINT, we're limited to a fixed size of memory and so we have the same limitations as WKB. Filling every dimension with NaN seems like the least bad solution and is more in line with existing WKB in the wild. Other solutions could be:

  • Add struct(bool, fixed_size_list_of(double)) to the supported list of storage types for geoarrow.point (where the bool indicates emptiness). This is more or less what GeoPackage does (adds a small header to the WKB where the emptiness of points can be asserted). This could be added at a later date if the NaN thing becomes problematic.
  • Promote any Array where an empty point is encountered to a MULTIPOINT, which can handle EMPTY. This is undesirable because it becomes impossible to predict the output type of an operation without actually doing the operation (makes it harder but not impossible to preallocate data structures).

paleolimbot avatar Feb 19 '22 18:02 paleolimbot

Theoretically the innermost child array (a big flat buffer of doubles) containing coordinate values can also be nullable and have null elements, but I think here NaN would be unambiguous (except for points, where filling every dimension with NaN might be interpreted as an empty point).

Yes, I would personally strongly argue to not allow missing / NULL coordinates (so "missing" in Arrow/Parquet's meaning, i.e. a NULL value, which in Arrow is tracked in the validity bitmap). If you really want to use missing coordinates in a certain application, you can indeed use NaN for this if needed (and then it is your responsibility as the user of that data to handle those NaNs). As @paleolimbot mentions, the only corner case is that Point(NaN, NaN) will probably also be used to represent an empty point. Personally I think that's an acceptable trade-off to keep the spec simpler.

I would also prefer not allowing parts/rings to be nullable (so any of the intermediate levels, apart from the top-level), i.e. which is what I currently wrote in https://github.com/geopandas/geo-arrow-spec/pull/12. Are there use cases for missing/null parts or rings? (that cannot be represented using the NaN coordinates mentioned above)

jorisvandenbossche avatar Feb 22 '22 17:02 jorisvandenbossche

I think it's reasonable that NULL values should not be allowed except at the outermost level (i.e., the whole geometry can be null but no rings or coordinates). When I was prototyping the implementation, I found that setting the "nullable" flag for child arrays introduced some complexity and so perhaps a good wording is that there should be no null values (as opposed to the schema itself being nullable).

If such a use-case does come up, a user can use the same memory layout as an intermediary data structure (but must solve those problems before exporting or storing an array as a "geoarrow.something" extension array).

paleolimbot avatar Feb 22 '22 18:02 paleolimbot

Another fun and related problem I ran across while implement a point builder was the problem whereby a user tries to convert something like c("POINT (0 1)", "POINT Z (0 1 2)") to an Arrow geometry. This doesn't happen all that often but if it does it's really annoying to work around as a user if we error for this.

For argument's sake, I'm probably going to fill the missing dimensions with NaN (i.e., as if the user has provided c("POINT Z (0 1 NaN)", "POINT Z (0 1 2)"))...a more complicated solution would be to use a Union as the points array. Technically the implementation I'm working on can handle that but I think it would wreak all sorts of havoc for almost no benefit.

paleolimbot avatar Mar 31 '22 12:03 paleolimbot

I would advice against even trying that: it's a recipe for creating problems further down. The simple feature standard doesn't address the question which properties geometries in a set should meaningfully share, but I believe CRS and dimension are the main two. If not type: GEOMETRY (a set of geometries of mixed type) and GEOMETRYCOLLECTION (a single geometry consisting of mixed type) are both rabbit holes IMO.

edzer avatar Mar 31 '22 13:03 edzer

That's a good point..I'll see if anybody complains about the error message before reconsidering ( https://github.com/paleolimbot/geoarrow/pull/8/files#diff-102b5c8c0ec12ebe5bdf068851ee034a08a387ed09c85f2722403584a4fadb7fR39-R42 )

paleolimbot avatar Mar 31 '22 13:03 paleolimbot

I think our current spec handles nullability via Arrow's built-in handling: the outermost array can contain nulls (but any inner levels cannot). A user can declare a field non-nullable if the concept of a null geometry does not match what a producer needs to represent. Feel free to open an issue if I missed any unresolved discussion!

paleolimbot avatar Sep 26 '23 16:09 paleolimbot