pandera
pandera copied to clipboard
Dataframe validation check passing when it shouldnt
I'm finding that schema validation is intermittently passing in cases where I would expect it to fail. This only seems to occur when I construct the dataframe directly as a pandera.DataFrame. Weirdly, a copy of the invalid dataframe that evades the validation does fail. See example below...
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of pandera.
- [ ] (optional) I have confirmed this bug exists on the master branch of pandera.
Code Sample
import pandas as pd
import pandera as pa
import pandera.typing as pat
class Schema(pa.DataFrameModel):
x: pat.Series[int] = pa.Field(isin=[0, 1], coerce=True)
@pa.check_types
def f(df: pat.DataFrame[Schema]) -> None:
print(df)
# create a valid pandas DF then break it
df = pd.DataFrame(data={"x": [0, 0, 0]})
df.loc[2, "x"] = 3
# (1) throws as expected, but mypy needs silencing
# f(df) # type: ignore[arg-type]
# now create a DF using the pandera type, then break it
df_with_schema = pat.DataFrame[Schema](data={"x": [0, 0, 0]})
df_with_schema.loc[2, "x"] = 3
# (2) no mypy errors but this doesn't throw...
f(df_with_schema)
# (3) ... yet a copy of it does, because its not the original type, as mypy confirms
f(df_with_schema.copy()) # type: ignore[arg-type]
Expected behaviour
dfis invalid and the checked function raises aSchemaError. This workstyped_dfis invalid and the checked function should raise aSchemaError, but it doesn't- the copy of
typed_dfis of course also invalid, and the checked function raises aSchemaError. This works
Enviroment
- OS: linux
- Python: 3.9.16
- Pandas: 1.5.3
- Pandera: 0.14.5
The observed behaviour in 2. seems incorrect to me. I suspect that the reason 3. works is that the copy has been sliced to a pd.DataFrame. Do you agree?
PS I really like this package BTW,
My colleague has pointed out that adding with_pydantic=True parameter to the decorator gives the expected behaviour in 2 for inputs to the checked function. But the problem persists with any return values.
Hi @andrew-infogrid this is indeed a bug! Very strange... it may have to do with check_types not understanding pat.DataFrame objects, will need to investigate
Oh, I see what's happening. So this is a gotcha due to inplace-mutation of data. When you define a pandera-typed dataframe with pat.DataFrame[Schema](...), pandera injects a df.pandera.schema into the dataframe.
Pandera takes care not to re-validate already-validated data, and it has no way if you've mutated the dataframe in-place or not. It determines whether a dataframe is already schema-validated in check_types here by checking for the presence of a pandera.schema object in the pandas dataframe and whether it's the same as the schema in the underlying annotation.
The reason it's raising an error when you copy the dataframe is that the df.pandera.schema is erased on copy.
Would be open to having a discussion to change this behavior to make it less surprising, but also don't want to incur an additional runtime hit in check_types by always validating dataframes, even though they're previously been validated by the exact same schema.
Two solutions to this I can think of, not mutually exclusive:
- Better documentation on "pandera gotchas". In this case, document the behavior of in-place mutation.
- Patch the
pandera.typing.DataFramemethods, includelocand perhaps others, such that when they are called thepandera.schemaobject is removed from the dataframe.
Open to other ideas too!
I hope I havent opened a can of worms here.
First off I now understand that validation happens only on construction of the pandera DataFrame. Perhaps this could be more clearly documented, but it makes sense and avoids additional extra runtime checks. Pydantic also behaves this way (I checked). I don't, however, understand why with_pydantic=True changes this behaviour though - surely the validation should always happen on entry/exit from a checked function?
The background is that I wanted to use the pandera type when constructing the object as a way of avoiding littering my code with # type: ignore[arg-type] every time I a call a checked function.
My big concern is that any method inherited from pd.DataFrame that returns self (or a copy) is going to slice the object and remove the schema, and potentially produce misleading behaviour. This leads me to thinking that the pd.DataFrame should be a member, rather than a superclass of, pandera.DataFrame. What do you think?
I don't, however, understand why with_pydantic=True changes this behaviour though
Yeah, not sure either, will need to investigate.
Btw can you share a code snippet of code you'd typically write that gives you # type: ignore[arg-type], and how using pandera.typing.DataFrame gets rid of that error? (are you using mypy?)
@cosmicBboy I've edited my example code to include the mypy directives
Thanks for the edited snippet, I'll play around with this over the next few days.
My big concern is that any method inherited from pd.DataFrame that returns self (or a copy) is going to slice the object and remove the schema, and potentially produce misleading behaviour.
I think this is okay in the copying case... when a user calls a method on pat.DataFrame[Schema](...), I think the expectation is that this will likely cause the transformed data to no longer be valid under the schema (hence it actually is of pd.DataFrame and not pat.DataFrame[Schema]). I think it's okay that pandera will now need to re-validate the data with potentially the same, or another schema.
The issue here is with inplace mutation. When the user does df_with_schema.loc[2, "x"] = 3, pa.check_types still considers this a valid schema.
This leads me to thinking that the pd.DataFrame should be a member, rather than a superclass of, pandera.DataFrame.
Can you elaborate?
In any case, I think the simplest thing to do here is actually have check_types always check incoming data. This may have been pre-mature optimization on my part, as it's really up to the user whether they want to do:
df_with_schema = pat.DataFrame[Schema](data={"x": [0, 0, 0]})
... # random in-place stuff could happen to df_with_schema
# f is a check_types-guarded function should just revalidate the data
f(df_with_schema)
I think the simplest solution here Instead of the surprising behavior you're seeing, always validate data with check_types, and this user incurs the run-time cost.
As a sidenote, with_pydantic=True checks the data because it now uses pydantic's validation scheme (with the __get_validators__ yield method) and not pandera's, which bypasses the logic we've been talking here (of not re-validating what pandera thinks is valid data).
This leads me to thinking that the pd.DataFrame should be a member, rather than a superclass of, pandera.DataFrame.
Can you elaborate?
I mean implementing the validated DataFrame in terms of a "has-a" rather than an "is-a" relationship.
By subclassing pd.DataFrame you automatically inherit all its methods, but by making the pd.DataFrame a member you can selectively expose some or all of its methods. It may be sufficient to provide a single method that returns the pd.DataFrame (and invalidates the pa.DataFrame)?
I think the default behaviour of @check_types should be to always validate (see #1199, where I think the current behaviour is unintuitive). On the other hand, I think in specialized cases, the current behaviour (which basically assumes dataframes are immutable) is still useful to keep around.
An optional argument could be added to provide the current behaviour, which defaults to False. Perhaps @check_types(assume_immutable=True), although that argument name is not compelling to me.
Another option would be to add support for something like StaticFrame (https://github.com/static-frame/static-frame). These are not totally immutable, they are grow-only, i.e. you would only need to validate new columns and not the contents of those columns. I think this would be safer, although it would be more work, of course.
Hi! I've recently stumbled across this strange effect you described while I was writing tests. After an hour of debugging I found out that pandera silently adds a pandera attribute to pd.DataFrame object and then uses it as caching indicator.
I managed to reproduce this phenomenon with the following code:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame
class PipeOut(pa.DataFrameModel):
int_field: int = pa.Field()
@pa.check_types(lazy=True)
def pipeline() -> DataFrame[PipeOut]:
return pd.DataFrame.from_dict({"int_field": [1]})
@pa.check_types(lazy=True)
def my_func(df: DataFrame[PipeOut]):
pass
df = pipeline()
df["int_field"] = "abc"
# Does not fail validation
my_func(df)
And then I found this thread 😄
I believe such behaviour can be a surprise for those who wish to write tests in a manner of "let's break dataframe and see that my function raises a validation error". Could you, please, elaborate more on this gotcha in documentation? It could save time for those writing tests. Thanks in advance!