pandera
pandera copied to clipboard
SeriesModel -- support for defining an index on a series.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is.
We have SchemaModels, and we have inline types like P.Series[float], but we don't have a way to specify the kind of index that a series has. Consider this example function:
import pandas as pd
import pandera.typing as P
def is_positive_datetime_series(x: P.Series[P.Int32]) -> P.Series[bool]:
if not isinstance(x.index, pd.DatetimeIndex):
raise NotImplemented
return x > 0
Describe the solution you'd like
A clear and concise description of what you want to happen.
I'd like to be able to specify the index on a series, for places in my codebase that pass series with specific index types between functions.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
I've thought of two solutions would be acceptable:
Idea 1:
A schema model that borrows it's the idea of __root__ from pydantic:
import pandas as pd
import pandera as pa
import pandera.typing as P
class DatetimeAmountSeries(pa.SchemaModel):
index: P.Index[P.DateTime]
__root__: P.Series[P.Int32]
Idea 2:
More annotated type options for P.Series:
import pandera.typing as P
from typing import TypeAlias, Annotated
DatetimeAmountSeries: TypeAlias = Annotated[P.Series[P.Int32], P.Index[P.DateTime]]
Additional context
Add any other context or screenshots about the feature request here.
Oh, forgot to mention -- both of the proposed ideas are already valid code, but they don't validate how I'd like them to.
hi @zevisert thanks for proposing this, it's a great idea!
I think the syntax of idea # 1 will be more useful and expressive, since you can also provide pa.Field metadata and custom checks via the @pa.check method.
class DatetimeAmountSeries(pa.SchemaModel):
index: P.Index[P.DateTime]
__root__: P.Series[P.Int32]
After reading the pydantic docs on custom root types, one question I have about the __root__ keyword in pydantic is what specific use case it addresses? Basically want to make sure the semantics in pydantic and pandera match up.
A slight alternative to consider is that pandera should implicitly understand that a SchemaModel with a single Series attribute can validate series objects.
class DatetimeAmountSeries(pa.SchemaModel):
name: P.Series[P.Int32] = pa.Field(check_name=True) # validate the Series name. If False, don't check it.
index: P.Index[P.DateTime]
We could introduce a pa.FieldModel to separate concerns and make single-field validation more explicit... I'm concerned that conflating the purpose of SchemaModel to validate both dataframes and series might lead to confusion.
On the other hand, it would be convenient to be able to reuse SchemaModels for both datastructures.
Do you have any thoughts @jeffzi ?
Idea 2 is very verbose and not intuitive. You need to remember the order of arguments since the Annotated mechanism does not allow naming arguments.
one question I have about the root keyword in pydantic is what specific use case it addresses?
I think the use case is json(schema) output. Consider:
from typing import List
from pydantic import BaseModel
class Pets(BaseModel):
species: List[str]
print(Pets(species=["dog", "cat"]).json())
#> {"species": ["dog", "cat"]}
class Pets(BaseModel):
__root__: List[str]
print(Pets(__root__=["dog", "cat"]).json())
#> ["dog", "cat"]
The semantics do match up. __root__ indicates the type of the modeled pandas object.
I guess the default __root__ for regular SchemaModels should be __root__=DataFrame so that you can inherit a Series model and transform it to a dataframe model. __root__=DataFrame[Schema] seems dangerous though. Suppose your model inherits a base model A and you specify another model B in root: __root__=DataFrame[B].
A slight alternative to consider is that pandera should implicitly understand that a SchemaModel with a single Series attribute can validate series objects.
Users will be required to name the unique "column" of the series even if they don't care about it. On the other hand it circumvents the above problems and makes the model API consistent. We could introduce a SeriesModel to solve the dilemma:
class DatetimeAmountSeries(pa.SeriesModel):
name: P.Series[P.Int32] = pa.Field(check_name=True, ge=0)
index: P.Index[P.DateTime]
class Schema(pa.SchemaModel):
dttm_new_syntax: P.Series[DatetimeAmountSeries] # ignore index validation
ddtm: P.Series[P.Int32] = pa.Field(ge=0) # equivalent to dttm_new_syntax
The above syntax makes Series validation more re-usable. At the moment, you can re-use a pre-defined Field but you will still have to specify the dtype in the annotation. You could say it's similar to pydantic custom types. That would also give us a better way to introduce non-native "dtypes": email, paths, etc.
I'm down for introducing a new Model base class that handles this case, although I'd like to propose a slightly different naming to be ArrayModel.
class DatetimeAmount(pa.ArrayModel):
name: P.Int32 = pa.Field(check_name=True, ge=0) # no need to specify `Series` type
index: P.Index[P.DateTime] # optional index, only for pandas.Series
class Schema(pa.SchemaModel):
dttm_new_syntax: P.Series[DatetimeAmount] # ignore index validation
ddtm: P.Series[P.Int32] = pa.Field(ge=0) # equivalent to dttm_new_syntax
Then the array model can be used as a Series like so:
from pandera.typing.pandas import Series
def function(series: Series[DatetimeAmount]): ...
# and eventually
from pandera.typing.numpy import Array
from pandera.typing.pytorch import Tensor
def function(np_array: Array[DatetimeAmount]): ...
def function(torch_tensor: Tensor[DatetimeAmount]): ...
I think this strikes a nice balance of being specific enough to the pandas domain while being able to model all sorts of array-like data structures like numpy arrays, pytorch tensors, xarray.DataArray, and pandas.Series.
Basically the pattern I want to explore here is to:
- have
ArrayModelencapsulate properties about a semantic (potentially n-dimensional) array - have
SchemaModelencapsulate properties of a dict-like mapping of keys to n-dimensional arrays
pandas.DataFrame and xarray.Dataset are basically a mapping of keys to "alignable" arrays according to some type of coordinate system (pandas.Index, or coords in array).
This might be a little ambitious, i.e. pre-mature abstraction, but I do want to see how far we can take the whole idea of "defining a schema once, use it to validate a bunch of different data container types".
thoughts @zevisert @jeffzi ?
I agree with what you laid out. I don't think it's premature. Pandera has started opening up to new data containers. I'd rather explore the ArrayModel idea before consolidating support for non-pandas libraries.
One nitpick though. I agree it's nice not having to specify Series typing for ArrayModel but I think we shouldn't have to specify index typing either for consistency. I suggest an argument pa.Field(index:bool) that would only apply to "arrays" supporting an index.
Ditto on that @jeffzi! I think it's good timing to explore how we want to model your two bullet points @cosmicBboy.
Sure a lower level ArrayModel (maybe pa.Matrix??) makes sense to me.
I think the nitpick is reasonable. Pandas at least lets you get away with a default RangeIndex if not specified. Come to think of it -- given that pd.Series() with no arguments produces the warning The default dtype for empty Series will be 'object' instead of 'float64' in a future version., perhaps P.Series[P.ArrayModel] could be an allowable, albeit not that useful of a way to express a series with no dtye or index, sort of like class Lax(pandera.SchemaModel): pass does
I suggest an argument pa.Field(index:bool) that would only apply to "arrays" supporting an index.
Cool! This sounds good to me... it's also nice because it doesn't shoe-horn ArrayModel to use pandera.typing.pandas.Index as the index annotation.
@zevisert any interest in contributing a PR for this? @jeffzi is the expert when it comes to the SchemaModel stuff, but I can also help out with guidance if needed.
I'm not that versed with the SchemaModel classes, but taking a further step back, would it make sense to have a more granular level, i.e. a "value check" over primitive data types?
Rationale:
- checks can be composed into
mapoperations, oraggregationoperations. mapoperations check single values at a time (e.g. is > 10)aggregationoperations use multiple values from a list/tensor/dataframe (e.g. sum() > 10) so why not start at the most granular thing and build from there?
Example use case context: right now with Hamilton people can return primitive types and they can't use pandera to express the check on them. E.g. the function returns the mean of some series.
Just spit balling here, but this is what I believe I'm suggesting:
class SpendAmount(pa.ValueModel):
value: P.Int32 = pa.Field(ge=0, nan=False, le=1000) # can only have `value` field?
class DatetimeAmount(pa.ArrayModel):
name: SpendAmount = pa.Field(mean=dict(ge=20, le=30)) # making this aggregation syntax check up
index: P.Index[P.DateTime] # optional index, only for pandas.Series
class MyDataFrameSchema(pa.SchemaModel):
...
alternatively if this doesn't fit here, then maybe a ValueSchema class analogous to DataFrameSchema and SeriesSchema?
Hi.
Gladly I found this, since I need a SeriesModel.
We have DataFrameModel plus DataFrameSchema. But we have only SeriesSchema and no SeriesModel.
My suggestion.
- The now deprecated
SchemaModelis undeprecated and contains all current code, common betweenDataFrameModelandSeriesModel - The new
DataFrameModelis justclass DataFrameModel(SchemaModel): pass - The new
SeriesModelfollows the same principleclass SeriesModel(SchemaModel): # differences in behavior
Now SeriesModel does some things different from DataFrameModel:
- Forces one column and one column exactly for the Model
- The column can be overriden with inheritance, to specify new metadata, but the name must be the same.
- The default
Field(check_name=None)for the single column in a Series is assumed asFalseduring validation since often we're not too concerned about a series name. - Following on 3,
check_name=...can be set on thepa.Field()or as aSeriesModelConfig value e.g.column_check_name=False|True. SeriesModel.to_schema()obviously returns aSeriesSchemaobject.
So, this matches a bit what @cosmicBboy said, e.g.
A slight alternative to consider is that pandera should implicitly understand that a SchemaModel with a single Series attribute can validate series objects.
however I don't entirely agree on
I'm down for introducing a new Model base class that handles this case, although I'd like to propose a slightly different naming to be ArrayModel.
Since everyone that works in pandas knows what a DataFrame and Series are, and those names should be reused.
To conclude, the advantge of this proposal is that the changes needed to implement are minimal and already use much of the existing classes.