pyogrio icon indicating copy to clipboard operation
pyogrio copied to clipboard

How to handle NaNs in the input when writing? (as NaN or as Null)

Open jorisvandenbossche opened this issue 2 years ago • 6 comments

See https://github.com/geopandas/pyogrio/pull/118/files/530b5d4c68a3380d2a43caa3a9fe45b3ca717aa3..cb28a2e379874bf1f5614b309527bbf24236ea96#r890345534 where this came up, copying the relevant comment here:

np.nan values in an OFTReal (float) column are automatically treated correctly (as null) by GDAL (OGR_F_SetFieldDouble), so no need to explicitly call OGR_F_SetFieldNull for OFTReal columns. Only for object columns this is needed.

Actually, that's not fully correct I think. They are written as NaN and not as Null. Since numpy/pandas only support NaN in float arrays, that's not an issue for correct rountrip for geopandas->gdal->geopandas, though.

Small illustrations:

# write file with GDAL + pyogrio
import geopandas
import pyogrio
gdf = geopandas.GeoDataFrame({'col': [0.1, np.nan]}, geometry=geopandas.points_from_xy([0, 1], [0, 1]))
pyogrio.write_dataframe(gdf, "test_nulls_pyogrio.arrow", driver="Arrow")

# write file with pyarrow that includes both NaN and Null (Arrow distinguishes both)
import pyarrow as pa
from pyarrow import feather
feather.write_feather(pa.table({"col": [0.1, np.nan, None]}), "test_nulls_pyarrow.arrow")

And check both using GDAL's ogrinfo:

(gdal-dev) $ ogrinfo test_nulls_pyogrio.arrow -al
INFO: Open of `test_nulls_pyogrio.arrow'
      using driver `Arrow' successful.

Layer name: test_nulls_pyogrio
Geometry: Point
Feature Count: 2
Extent: (0.000000, 0.000000) - (1.000000, 1.000000)
Layer SRS WKT:
(unknown)
Geometry Column = geometry
col: Real (0.0)
OGRFeature(test_nulls_pyogrio):0
  col (Real) = 0.1
  POINT (0 0)

OGRFeature(test_nulls_pyogrio):1
  col (Real) = nan
  POINT (1 1)

(gdal-dev) $ ogrinfo test_nulls_pyarrow.arrow -al
INFO: Open of `test_nulls_pyarrow.arrow'
      using driver `Arrow' successful.

Layer name: test_nulls_pyarrow
Geometry: None
Feature Count: 3
Layer SRS WKT:
(unknown)
col: Real (0.0)
OGRFeature(test_nulls_pyarrow):0
  col (Real) = 0.1

OGRFeature(test_nulls_pyarrow):1
  col (Real) = nan

OGRFeature(test_nulls_pyarrow):2
  col (Real) = (null)

jorisvandenbossche avatar Jun 07 '22 13:06 jorisvandenbossche

So when we are writing float columns that contain NaNs, we have two options: write those as NaN or as Null.

Since the input data uses NaN, writing them as NaN (as we do currently for float data) might look the most correct. But semantically, NaNs in pandas are used to indicate missing data, and thus might be more correctly represented as Null.

As reference, in pyarrow (which distinguishes NaN and Null in float data, similar to GDAL), the conversion layer will actually convert NaNs to Nulls when the input comes from pandas, but preserve the NaNs when the data comes from numpy / pure python list.

jorisvandenbossche avatar Jun 07 '22 13:06 jorisvandenbossche

I'm not yet clear on what we should be doing here. Given that we are writing from numpy arrays to GDAL, we don't have the ability to represent both np.nan and None (Null) as distinct values. So we either write all np.nan or set all np.nan values to Null, but can't do some mix like Arrow, right?

For float fields, we convert Null values back to np.nan, so I think in either case we can roundtrip data and the caller would see them as the same values. The only thing that would be different is if they are stored as NaN or Null in the underlying dataset consumed by potentially other tools.

brendan-ward avatar Jun 07 '22 16:06 brendan-ward

It would indeed be either all NaN or all Null, not a mix of them (which makes sense, because also pandas and numpy cannot represent a mix and so can't pass that to pyogrio)

The only thing that would be different is if they are stored as NaN or Null in the underlying dataset consumed by potentially other tools.

Yes, I think this should be the main consideration (as the roundtrip case for pyogrio/geopandas itself works fine either way). Other libraries using GDAL to read data might expect Nulls instead of NaNs for missing data (no idea how eg QGIS handles this, though)

jorisvandenbossche avatar Jun 07 '22 18:06 jorisvandenbossche

Sidenote: this also relates to potentially handling the experimental nullable dtypes of pandas more natively in pyogrio. For example the nullable integer data is represented with a data array + mask array, and in theory we could pass down those mask to ogr_write as well and use that to check whether to use OGR_F_SetFieldNull

jorisvandenbossche avatar Jun 07 '22 18:06 jorisvandenbossche

At least GDAL makes an effort to deal properly with nan vs NULL. Because different filetypes treat them differently, the GDAL drivers do so as well. Some examples:

  • in GeoJson nan is not supported according to the standard and so GDAL doesn't write nan values. FYI, using a layer creation option you can change this behaviour (https://gdal.org/drivers/vector/geojson.html#layer-creation-options).
  • in geopackage nan values are saved as NULL

theroggy avatar Jun 07 '22 20:06 theroggy

It would indeed be either all NaN or all Null, not a mix of them (which makes sense, because also pandas and numpy cannot represent a mix and so can't pass that to pyogrio)

The only thing that would be different is if they are stored as NaN or Null in the underlying dataset consumed by potentially other tools.

Yes, I think this should be the main consideration (as the roundtrip case for pyogrio/geopandas itself works fine either way). Other libraries using GDAL to read data might expect Nulls instead of NaNs for missing data (no idea how eg QGIS handles this, though)

To be complete, it is possible to combine None and nan in one numpy array, but only if the dtype is "object". For a "float" column None is converted to nan.

But, GDAL (logically) doesn't support to write nan to a string column, hence the fix for this in #118

test_gdf
              geometry  float64 object_str
0  POINT (0.000 0.000)      1.0       test
1  POINT (0.000 0.000)      NaN       None
2  POINT (0.000 0.000)      NaN        NaN

theroggy avatar Jun 07 '22 20:06 theroggy

Because different filetypes treat them differently, the GDAL drivers do so as well. Some examples:

Shapefile seems to support both Nulls and NaNs (based on looking at the output of ogrinfo, I am only not fully sure how to assert this from Python).

I opened https://github.com/geopandas/pyogrio/issues/122 to switch to write NaN as null by default (I think that is the most logical option, certainly for writing starting from a (geo)pandas dataframe), but added an option to control it (I am not sure how useful it is to expose this as a public option, but it is also easy to add).

jorisvandenbossche avatar Jan 02 '23 09:01 jorisvandenbossche