pandera
pandera copied to clipboard
strict=True is not very strict on the index
Hi, first of all thanks for this great library !
I today found out that I had a non-validated-enough dataframe, even if I was using strict=True. This was due to the fact that strict=True does not imply any kind of checks on the index.
Here is an example :
class FooModel(pa.DataFrameModel):
a: pa.typing.Series[int]
class Config:
strict = True
As a user, when I run FooModel.validate(df), since I added strict=True, I would expect that any error or missing aspect in FooModel leads to an exception being raised. At the contrary, if I do not see any exception, that leads me to think that my FooModel is correct.
Yet,
df = pd.DataFrame(index=["hello"], data={"a": [1]})
df.index.name = "foo"
FooModel.validate(df)
does not raise any error. It breaks somehow the semantics of strict=True in my opinion, as it leaves some room for flexibility in the dataframe to be validated. In this example the non-None name on the index of df, and the fact that the index has dtype object. Do you agree ?
I would suggest to modify strict=True to perform the following: when the schema does not contain any specification about the index, validate that the index is the default pandas index (a rangeindex with no name).
I noticed this too. Regardless of this flag, Pandera raises a column_in_dataframe check error only if a non-nullable column is missing. However, a column missing altogether is a different issue, separate from nullability, and of a different severity.
currently strict only operates on columns: https://github.com/unionai-oss/pandera/blob/main/pandera/backends/pandas/container.py#L480C9-L531
I would suggest to modify strict=True to perform the following: when the schema does not contain any specification about the index, validate that the index is the default pandas index (a rangeindex with no name).
Feel free to open up a PR for this! @smarie @daniel-ene-heni