pandera
pandera copied to clipboard
When mixing `drop_invalid_rows` on `DataFrameSchema` and `Column` level we get a non intuitive behavior
Describe the bug
When mixing drop_invalid_rows on DataFrameSchema and Column level we get a non intuitive behavior.
- If you set
drop_invalid_rowsas aDataFrameSchemaparameter and have nodrop_invalid_rowsas column parameter, all rows which fail the validation are dropped. Works as expected. - When setting
drop_invalid_rowsas column parameter and not asDataFrameSchemaparameter, columns which fail are not dropped and no error is raised. Listing [1] - If set
drop_invalid_rows=TrueonDataFrameSchemaand at aColumn. Columns withdrop_invalid_rows=Trueare not dropped and no error is risen and columns withdrop_invalid_rows=Falseare dropped. Listing [2]
If this behavior is indented, we should document it, otherwise see the expected results
Code Sample
Listing [1]
import pandas as pd
from pandera import Check, Column, DataFrameSchema
df = pd.DataFrame(
{
"counter": [1, 2, 3, 4],
"text": ["abc", "def", "ghi", None],
}
)
schema = DataFrameSchema(
{
"counter": Column(
int,
checks=[Check(lambda x: x >= 3)],
drop_invalid_rows=True,
),
"text": Column(
str,
nullable=False,
drop_invalid_rows=True,
),
},
)
schema.validate(df, lazy=True)
| counter | text |
|---|---|
| 1 | abc |
| 2 | def |
| 3 | ghi |
| 4 | None |
Listing [2]
import pandas as pd
from pandera import Check, Column, DataFrameSchema
df = pd.DataFrame(
{
"counter": [1, 2, 3, 4],
"text": ["abc", "def", "ghi", None],
}
)
schema = DataFrameSchema(
{
"counter": Column(
int,
checks=[Check(lambda x: x >= 3)],
drop_invalid_rows=True,
),
"text": Column(
str,
nullable=False,
drop_invalid_rows=False,
),
},
drop_invalid_rows=True,
)
schema.validate(df, lazy=True)
| counter | text |
|---|---|
| 1 | abc |
| 2 | def |
| 3 | gh |
Expected behavior
For listing [1] I would expect the columns to be dropped with drop_invalid_rows=True or get a warning that I have to set drop_invalid_rows=True as DataFrameSchema parameter
For listing [2] I would expect the columns with drop_invalid_rows=True as column parameter to be dropped and the other to raise an error.
Desktop
- OS: Windows 10
- Python 3.10.6
- Pandera: 0.20.1
Thanks for reporting this @jherrmannNetfonds, this is definitely a bug, the two cases you listed should work as you expect. Will look into this
Hi, First of all, thank you for pandera 🙏 ❤.
We are hitting this issue too and I was wondering if you plan on working on it or would rather welcome a PR ? After a quick glance, it seems the bug is spread in a few places: https://github.com/search?q=repo%3Aunionai-oss%2Fpandera%20self.drop_invalid_rows(&type=code