pyogrio
pyogrio copied to clipboard
How to handle NaNs in the input when writing? (as NaN or as Null)
See https://github.com/geopandas/pyogrio/pull/118/files/530b5d4c68a3380d2a43caa3a9fe45b3ca717aa3..cb28a2e379874bf1f5614b309527bbf24236ea96#r890345534 where this came up, copying the relevant comment here:
np.nan values in an OFTReal (float) column are automatically treated correctly (as null) by GDAL (OGR_F_SetFieldDouble), so no need to explicitly call OGR_F_SetFieldNull for OFTReal columns. Only for object columns this is needed.
Actually, that's not fully correct I think. They are written as NaN and not as Null. Since numpy/pandas only support NaN in float arrays, that's not an issue for correct rountrip for geopandas->gdal->geopandas, though.
Small illustrations:
# write file with GDAL + pyogrio
import geopandas
import pyogrio
gdf = geopandas.GeoDataFrame({'col': [0.1, np.nan]}, geometry=geopandas.points_from_xy([0, 1], [0, 1]))
pyogrio.write_dataframe(gdf, "test_nulls_pyogrio.arrow", driver="Arrow")
# write file with pyarrow that includes both NaN and Null (Arrow distinguishes both)
import pyarrow as pa
from pyarrow import feather
feather.write_feather(pa.table({"col": [0.1, np.nan, None]}), "test_nulls_pyarrow.arrow")
And check both using GDAL's ogrinfo
:
(gdal-dev) $ ogrinfo test_nulls_pyogrio.arrow -al
INFO: Open of `test_nulls_pyogrio.arrow'
using driver `Arrow' successful.
Layer name: test_nulls_pyogrio
Geometry: Point
Feature Count: 2
Extent: (0.000000, 0.000000) - (1.000000, 1.000000)
Layer SRS WKT:
(unknown)
Geometry Column = geometry
col: Real (0.0)
OGRFeature(test_nulls_pyogrio):0
col (Real) = 0.1
POINT (0 0)
OGRFeature(test_nulls_pyogrio):1
col (Real) = nan
POINT (1 1)
(gdal-dev) $ ogrinfo test_nulls_pyarrow.arrow -al
INFO: Open of `test_nulls_pyarrow.arrow'
using driver `Arrow' successful.
Layer name: test_nulls_pyarrow
Geometry: None
Feature Count: 3
Layer SRS WKT:
(unknown)
col: Real (0.0)
OGRFeature(test_nulls_pyarrow):0
col (Real) = 0.1
OGRFeature(test_nulls_pyarrow):1
col (Real) = nan
OGRFeature(test_nulls_pyarrow):2
col (Real) = (null)
So when we are writing float columns that contain NaNs, we have two options: write those as NaN or as Null.
Since the input data uses NaN, writing them as NaN (as we do currently for float data) might look the most correct. But semantically, NaNs in pandas are used to indicate missing data, and thus might be more correctly represented as Null.
As reference, in pyarrow (which distinguishes NaN and Null in float data, similar to GDAL), the conversion layer will actually convert NaNs to Nulls when the input comes from pandas, but preserve the NaNs when the data comes from numpy / pure python list.
I'm not yet clear on what we should be doing here. Given that we are writing from numpy arrays to GDAL, we don't have the ability to represent both np.nan
and None
(Null) as distinct values. So we either write all np.nan or set all np.nan values to Null, but can't do some mix like Arrow, right?
For float fields, we convert Null values back to np.nan, so I think in either case we can roundtrip data and the caller would see them as the same values. The only thing that would be different is if they are stored as NaN or Null in the underlying dataset consumed by potentially other tools.
It would indeed be either all NaN or all Null, not a mix of them (which makes sense, because also pandas and numpy cannot represent a mix and so can't pass that to pyogrio)
The only thing that would be different is if they are stored as NaN or Null in the underlying dataset consumed by potentially other tools.
Yes, I think this should be the main consideration (as the roundtrip case for pyogrio/geopandas itself works fine either way). Other libraries using GDAL to read data might expect Nulls instead of NaNs for missing data (no idea how eg QGIS handles this, though)
Sidenote: this also relates to potentially handling the experimental nullable dtypes of pandas more natively in pyogrio. For example the nullable integer data is represented with a data array + mask array, and in theory we could pass down those mask to ogr_write
as well and use that to check whether to use OGR_F_SetFieldNull
At least GDAL makes an effort to deal properly with nan vs NULL. Because different filetypes treat them differently, the GDAL drivers do so as well. Some examples:
- in GeoJson nan is not supported according to the standard and so GDAL doesn't write nan values. FYI, using a layer creation option you can change this behaviour (https://gdal.org/drivers/vector/geojson.html#layer-creation-options).
- in geopackage nan values are saved as NULL
It would indeed be either all NaN or all Null, not a mix of them (which makes sense, because also pandas and numpy cannot represent a mix and so can't pass that to pyogrio)
The only thing that would be different is if they are stored as NaN or Null in the underlying dataset consumed by potentially other tools.
Yes, I think this should be the main consideration (as the roundtrip case for pyogrio/geopandas itself works fine either way). Other libraries using GDAL to read data might expect Nulls instead of NaNs for missing data (no idea how eg QGIS handles this, though)
To be complete, it is possible to combine None and nan in one numpy array, but only if the dtype is "object". For a "float" column None is converted to nan.
But, GDAL (logically) doesn't support to write nan to a string column, hence the fix for this in #118
test_gdf
geometry float64 object_str
0 POINT (0.000 0.000) 1.0 test
1 POINT (0.000 0.000) NaN None
2 POINT (0.000 0.000) NaN NaN
Because different filetypes treat them differently, the GDAL drivers do so as well. Some examples:
Shapefile seems to support both Nulls and NaNs (based on looking at the output of ogrinfo
, I am only not fully sure how to assert this from Python).
I opened https://github.com/geopandas/pyogrio/issues/122 to switch to write NaN as null by default (I think that is the most logical option, certainly for writing starting from a (geo)pandas dataframe), but added an option to control it (I am not sure how useful it is to expose this as a public option, but it is also easy to add).