pandera
pandera copied to clipboard
How to properly extend pandas_engine.DataType to support geopandas
How to properly extend pandas_engine.DataType
Hi! I'm trying to use pandera with GeoPandas - I think all I should need to do to make it work is add support for the geometry
column by registering BaseGeometry
as a DataType
. However, I'm struggling to get it to work - any suggestions?
import dataclasses
import shapely.geometry.base
from pandera import SchemaModel, dtypes, Field
from pandera.engines import pandas_engine
from pandera.engines.pandas_engine import DataType
from pandera.typing import Series, DataFrame
from shapely.geometry import box
@pandas_engine.Engine.register_dtype
@dtypes.immutable
class BaseGeometry(DataType):
type: shapely.geometry.base.BaseGeometry = dataclasses.field(default=None, init=False)
class GeoDataFrameSchema(SchemaModel):
geometry: Series[BaseGeometry] = Field()
df = DataFrame[GeoDataFrameSchema]({'geometry': [box(0, 0, 100, 100), box(200, 0, 400, 200)]})
File "typed_dataframes.py", line 21, in <module>
df = DataFrame[GeoDataFrameSchema]({'geometry': [box(0, 0, 100, 100), box(200, 0, 400, 200)]})
File "/usr/local/Cellar/[email protected]/3.8.12_1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/typing.py", line 731, in __call__
result.__orig_class__ = self
File "/venv/lib/python3.8/site-packages/pandera/typing/common.py", line 118, in __setattr__
self = schema_model.validate(self)
File "/venv/lib/python3.8/site-packages/pandera/model.py", line 261, in validate
cls.to_schema().validate(
File "/venv/lib/python3.8/site-packages/pandera/schemas.py", line 485, in validate
return self._validate(
File "/venv/lib/python3.8/site-packages/pandera/schemas.py", line 659, in _validate
error_handler.collect_error("schema_component_check", err)
File "/venv/lib/python3.8/site-packages/pandera/error_handlers.py", line 32, in collect_error
raise schema_error from original_exc
File "/venv/lib/python3.8/site-packages/pandera/schemas.py", line 651, in _validate
result = schema_component(
File "/venv/lib/python3.8/site-packages/pandera/schemas.py", line 1986, in __call__
return self.validate(
File "/venv/lib/python3.8/site-packages/pandera/schema_components.py", line 223, in validate
validate_column(check_obj, column_name)
File "/venv/lib/python3.8/site-packages/pandera/schema_components.py", line 196, in validate_column
super(Column, copy(self).set_name(column_name)).validate(
File "/venv/lib/python3.8/site-packages/pandera/schemas.py", line 1919, in validate
error_handler.collect_error(
File "/venv/lib/python3.8/site-packages/pandera/error_handlers.py", line 32, in collect_error
raise schema_error from original_exc
pandera.errors.SchemaError: expected series 'geometry' to have type None, got object
Hi @roshcagra, interesting question !
pandera.errors.SchemaError: expected series 'geometry' to have type None, got object
It's a little confusing to trace back the source of the error but here is the gist of it. Your BaseGeometry.type
is None
by default and your SchemaModel
does not provide a type
. During validation, pandera will call BaseGeometry.check()
, inherited from pandas_engine.DataType
, which leverages the type
argument. The type is supposed to be something understood by pandas. You can pass arguments to the dtype with the following syntax:
class GeoDataFrameSchema(SchemaModel):
geometry: Series[BaseGeometry] = Field(dtype_kwargs={"type": APPROPRIATE_TYPE}) # APPROPRIATE_TYPE = ?
I'm not familiar with geopandas or shapely, and I couldn't see usage of subclasses of shapely.geometry.base.BaseGeometry
in geopandas getting started tutorial. Geopandas seems to only have a single dtype GeometryDtype. I would use it to benefit from the work that has already been done by geopandas.
import geopandas
import pandera as pa
from pandera.engines import pandas_engine
@pandas_engine.Engine.register_dtype(
equivalents=[ # Let pandera know how to translate this data type from other objects
"geometry",
geopandas.array.GeometryDtype,
geopandas.array.GeometryDtype(),
]
)
@pa.dtypes.immutable
class Geometry(pandas_engine.DataType):
type = geopandas.array.GeometryDtype()
class GeoDataFrameSchema(pa.SchemaModel):
geometry: pa.typing.Series[Geometry]
BoroCode: pa.typing.Series[Geometry] # should fail (contains int)
BoroName: pa.typing.Series[Geometry] # should fail (contains object)
gdf = geopandas.read_file(geopandas.datasets.get_path("nybb"))
gdf.info()
#> <class 'geopandas.geodataframe.GeoDataFrame'>
#> RangeIndex: 5 entries, 0 to 4
#> Data columns (total 5 columns):
#> # Column Non-Null Count Dtype
#> --- ------ -------------- -----
#> 0 BoroCode 5 non-null int64
#> 1 BoroName 5 non-null object
#> 2 Shape_Leng 5 non-null float64
#> 3 Shape_Area 5 non-null float64
#> 4 geometry 5 non-null geometry
#> dtypes: float64(2), geometry(1), int64(1), object(1)
#> memory usage: 328.0+ bytes
# verify that pandera recognizes geometry
print(repr(pandas_engine.Engine.dtype(gdf["geometry"].dtype)))
#> DataType(geometry)
GeoDataFrameSchema.validate(gdf, lazy=True)
#> Traceback (most recent call last):
#> ...
#> SchemaErrors: A total of 2 schema errors were found.
#> Error Counts
#> - schema_component_check: 2
#> Schema Error Summary
#> failure_cases n_failure_cases
#> schema_context column check
#> Column BoroCode dtype('geometry') [int64] 1
#> BoroName dtype('geometry') [object] 1
#> Usage Tip
#> Directly inspect all errors by catching the exception:
#> ```
#> try:
#> schema.validate(dataframe, lazy=True)
#> except SchemaErrors as err:
#> err.failure_cases # dataframe of schema errors
#> err.data # invalid dataframe
#> ```
^ The draft above is not tested but works in this basic example.
@roshcagra Would you be interested in extending this snippet and contributing proper geopandas support? I'm sure other geopandas users would benefit from schema validation !
Thank you so much @jeffzi !
I would love to expand on this and contribute. What would be the best way to do that? Maybe build another package that depends on pandera
and adds support for geopandas? I'm guessing we wouldn't want to add it directly here because you wouldn't want to take geopandas and shapely on as dependencies.
I would love to expand on this and contribute.
Awesome, thanks !
I think a module inside the core pandera repo should suffice. We have done this before by adding optional dependencies. See setup.py and strategies module which requires hypothesis
to function.
Besides the GeometryDtype
, are there GeoDataFrame
specificities that are relevant to schema validation? If it's only the dtype, we could have the class in engines.pandas_engine
and tests in tests/geopandas/test_geopandas.py
(easier for CI to install appropriate dependencies).
Pinging @cosmicBboy to confirm the approach.
thanks for you help @roshcagra!
yes the approach described by @jeffzi is the way to go.
Out of curiosity, I have a few questions (as someone who hasn't used geopandas
before):
- is there any meaningful way in which a
GeometryDtype
is coerced from some other raw format, for e.g. as"1" -> int("1")
as"some raw value" -> GeometryDtype("some raw value")
? - Similarly, does this operation happen
geodataframe.astype({"geometry": GeometryDtype})
as a user of the library? - I also see there are specific data types in
shapely
likePoint
andPolygon
... would it make sense to have those as types as well?
I think for now a GeometryDtype
that doesn't do any type coercion and simply does a type check would be a good first pass for geopandas
support. Then, any additional checks on the geometry
dtype column would be done via custom checks.
Let us know if you have any other questions re: contributing!
Hey @cosmicBboy, I'm a contributor to geopandas and coincidentally just started looking at using geopandas with pandera today. But I might be able to give some clarity on those questions (sorry this is a longer write up than I thought it would be).
is there any meaningful way in which a GeometryDtype is coerced from some other raw format, for e.g. as "1" -> int("1") as "some raw value" -> GeometryDtype("some raw value")
First, there are are Well Known Text (WKT) and Well Known Binary (WKB) which I suppose are analogues of raw formats, but these are not coerced with astype
or convert_dtypes
, instead they're covered by classmethods: geopandas.GeoSeries.from_wkt
, where GeoSeries is the geopandas subclass of a pandas.Series
for geometry data.
For the second point,
does this operation happen geodataframe.astype({"geometry": GeometryDtype}) as a user of the library
there is a case of casting with astype like this, with an array-like of shapely geometries (and potentially also pygeos geometries, but that's tangential):
In [1]: from shapely.geometry import Point
In [2]: gdf = gpd.GeoDataFrame({'foo':[1,2], 'bar':[Point(1,1), Point(2,2)]}, geometry=[Point(1,1), Point(2,1)])
In [3]: gdf
Out[3]:
foo bar geometry
0 1 POINT (1 1) POINT (1.00000 1.00000)
1 2 POINT (2 2) POINT (2.00000 1.00000)
In [4]: gdf.dtypes
Out[4]:
foo int64
bar object
geometry geometry
dtype: object
'bar'
has object dtype (geometry has been converted properly because it is the designated "geometry column", which has special casting checks applied to it), which can be fixed with astype
:
In [5]: gdf.astype({'bar':'geometry'})
Out[5]:
foo bar geometry
0 1 POINT (1.00000 1.00000) POINT (1.00000 1.00000)
1 2 POINT (2.00000 2.00000) POINT (2.00000 1.00000)
In [6]: gdf.astype({'bar':'geometry'}).dtypes
Out[6]:
foo int64
bar geometry
geometry geometry
dtype: object
In [7]: gdf.astype({'bar':gpd.array.GeometryDtype()}).dtypes
Out[7]:
foo int64
bar geometry
geometry geometry
dtype: object
The other important thing this does is convert 'bar'
from being a Series
to a GeoSeries
which has properties like e.g. area
.
In [8]: type(gdf['bar'])
Out[8]: pandas.core.series.Series
In [9]: type(gdf.astype({'bar':gpd.array.GeometryDtype()})['bar'])
Out[10]: geopandas.geoseries.GeoSeries
So this can happen as a user of the library, but I would say it is possible, but not exactly common. Usually one is better of doing something like this
srs = gpd.GeoSeries([Point(1,1), Point(2,2)], crs='epsg:4326')
gdf = gdf = gpd.GeoDataFrame({'foo':[1,2], 'bar':srs}, geometry=[Point(1,1), Point(2,1)])
the advantage of which is that bar
can be specified with a coordinate reference system (CRS), which encodes information about the projection of geometry on the earth's surface to a cartesian plane for e.g area and distance calculations.
I also see there are specific data types in shapely like Point and Polygon... would it make sense to have those as types as well?
Geopandas can store Points, Polygons, Multipoints, ... all in the same GeoSeries with the same extension array GeometryDtype
.
Perhaps there is value in validating those types explicitly for certain workloads. There is GeoSeries.geom_type
which returns an object array where each row is Points, Polygons, Multipoints, ... but I've never needed to do this - I didn't really know that method existed until writing this. Usually the limiting factor in this aspect is the underlying geometry data source, most GIS file formats only support geometry columns of homogenous types, so validating this on the geopandas side doesn't tend to come up.
I'd be quite keen to see this in pandera, happy to help if I can - @roshcagra seems keen to get started so I won't duplicate effort there.
Also, just wanted to say that pandera has been a really useful tool, thanks for developing and improving it.
Just adding on to the above question by @jeffzi
Besides the
GeometryDtype
, are thereGeoDataFrame
specificities that are relevant to schema validation?
There are two special NDFrame._metadata
fields _metadata = ["_crs", "_geometry_column_name"]
which perhaps could warrant special handling (but to be honest I don't really know how that would work on the pandera side) - and I also don't feel it's essential (for my use case, I want to specify a geometry column in my schema and that's probably enough (although knowing the crs isn't none would be nice).
-
"_geometry_column_name"
stores the column name of the "active" geometry column, which is the column that spatial operations e.g. buffer, area, spatial join are performed with respect to. This column is aliased asgdf.geometry
. So I could see value in this being checked in some way - this is also something that could be "coerced" via a call toGeoDataFrame.set_geometry
. -
"_crs"
(accessed from the public api asgdf.crs
) stores the coordinate reference system of the active geometry column on the geodataframe itself. IndividualGeoSeries
also have their owncrs
attribute, which are not necessarily always the same. So this could also potentially be validated on the series level, but would also add complexity.
Thanks for the detailed analysis @m-richards, and I'm glad you're finding pandera useful!
It seems like GeoSeries and GeoDataFrame already does a lot of heavy lifting in terms of checking types of GeometryDtype
arrays, and adding support for GeometryDtype
, which is @roshcagra's use case, would cover a majority of the type-checking use cases.
Once there's a pandera.Geometry
data type, adding a coerce
method that does astype("geometry")
would be pretty straightforward.
Since pandera allows for parameterized dtypes some nice future work would be to support this kind of syntax:
Geometry # array can contain any type of geometry
Geometry("Point", crs="epsg:4326") # only points of a specific crs
Geometry("Polygon", crs=...) # only polygons
Geometry("Multipoint", crs=...) # only multipoints
# - or specific dtype classes -
Geometry
Point(crs=...)
Polygon(crs=...)
Multipoint(crs=...)
And then use custom pa.Check
s / registering custom checks to implement more specific validation rules.
Let me know if you have any more thoughts @roshcagra @jeffzi @m-richards !
@cosmicBboy @jeffzi @m-richards
This is my first pass:
from typing import Union
import pandas as pd
import geopandas as gpd
import pandera as pa
from pandera.engines import pandas_engine
from pandera.typing import DataFrame
from pandera.typing.common import SeriesBase
from pandera.typing.pandas import T
GeoPandasObject = Union[gpd.GeoSeries, pd.Index, gpd.GeoDataFrame]
@pandas_engine.Engine.register_dtype(
equivalents=[ # Let pandera know how to translate this data type from other objects
"geometry",
gpd.array.GeometryDtype,
gpd.array.GeometryDtype(),
]
)
@pa.dtypes.immutable
class Geometry(pandas_engine.DataType):
type = gpd.array.GeometryDtype()
def coerce(self, data_container: pd.Series) -> gpd.GeoSeries:
return gpd.GeoSeries.from_wkt(data_container)
class GeoSeries(SeriesBase[gpd.array.GeometryDtype], gpd.GeoSeries):
"""Representation of geopandas.GeoSeries, only used for type annotation."""
pass
class GeoDataFrame(DataFrame[T], gpd.GeoDataFrame):
"""Representation of geopandas.GeoDataFrame, only used for type annotation."""
pass
Let me know what you think!
@m-richards Thanks, appreciate you're taking the time to explain in details !
And then use custom pa.Checks / registering custom checks to implement more specific validation rules.
Yes, my question was about whether we'd need to add built-in checks to better support geopandas but it does not seem to be necessary.
Let me know if you have any more thoughts @roshcagra @jeffzi @m-richards !
It will be important to test that validate
does return a geopandas Dataframe if presented one, and preserves the geopandas metadata attributes kindly listed by @m-richards.
@roshcagra Looking good so far! Tbh, tests will reveal potential issues. You can add a mapping dtype: string alias
in test_dtypes:
https://github.com/pandera-dev/pandera/blob/7664092b020288b245071c18b07e1df356ae1515/tests/core/test_dtypes.py#L86-L92
and examples here https://github.com/pandera-dev/pandera/blob/7664092b020288b245071c18b07e1df356ae1515/tests/core/test_dtypes.py#L152-L153
That will add your new Geometry
data type to the test suite. test_dtypes.py
is rather complicated, don't hesitate to let me know if you need help at any point.
I'm curious aboutSeriesBase
, is that a rename of pandera.typing.Series
?
@cosmicBboy We should probably factor out the basic data type tests to facilitate testing of new data types and even use for koalas/modin testing.
@jeffzi @cosmicBboy PR for this up here: https://github.com/pandera-dev/pandera/pull/698