pandera When mixing `drop_invalid_rows` on `DataFrameSchema` and `Column` level we get a non intuitive behavior

When mixing `drop_invalid_rows` on `DataFrameSchema` and `Column` level we get a non intuitive behavior

Open jherrmannNetfonds opened this issue 1 year ago • 2 comments

trafficstars

Describe the bug When mixing drop_invalid_rows on DataFrameSchema and Column level we get a non intuitive behavior.

If you set drop_invalid_rows as a DataFrameSchema parameter and have no drop_invalid_rows as column parameter, all rows which fail the validation are dropped. Works as expected.
When setting drop_invalid_rows as column parameter and not as DataFrameSchema parameter, columns which fail are not dropped and no error is raised. Listing [1]
If set drop_invalid_rows=True on DataFrameSchema and at a Column. Columns with drop_invalid_rows=True are not dropped and no error is risen and columns with drop_invalid_rows=False are dropped. Listing [2]

If this behavior is indented, we should document it, otherwise see the expected results

Code Sample

Listing [1]

import pandas as pd
from pandera import Check, Column, DataFrameSchema

df = pd.DataFrame(
    {
        "counter": [1, 2, 3, 4],
        "text": ["abc", "def", "ghi", None],
    }
)
schema = DataFrameSchema(
    {
        "counter": Column(
            int,
            checks=[Check(lambda x: x >= 3)],
            drop_invalid_rows=True,
        ),
        "text": Column(
            str,
            nullable=False,
            drop_invalid_rows=True,
        ),
    },
)

schema.validate(df, lazy=True)

counter	text
1	abc
2	def
3	ghi
4	None

Listing [2]

import pandas as pd
from pandera import Check, Column, DataFrameSchema

df = pd.DataFrame(
    {
        "counter": [1, 2, 3, 4],
        "text": ["abc", "def", "ghi", None],
    }
)
schema = DataFrameSchema(
    {
        "counter": Column(
            int,
            checks=[Check(lambda x: x >= 3)],
            drop_invalid_rows=True,
        ),
        "text": Column(
            str,
            nullable=False,
            drop_invalid_rows=False,
        ),
    },
    drop_invalid_rows=True,
)

schema.validate(df, lazy=True)

counter	text
1	abc
2	def
3	gh

Expected behavior

For listing [1] I would expect the columns to be dropped with drop_invalid_rows=True or get a warning that I have to set drop_invalid_rows=True as DataFrameSchema parameter For listing [2] I would expect the columns with drop_invalid_rows=True as column parameter to be dropped and the other to raise an error.

Desktop

OS: Windows 10
Python 3.10.6
Pandera: 0.20.1

Jul 11 '24 10:07 jherrmannNetfonds

Thanks for reporting this @jherrmannNetfonds, this is definitely a bug, the two cases you listed should work as you expect. Will look into this

Jul 17 '24 16:07 cosmicBboy

Hi, First of all, thank you for pandera 🙏 ❤.

We are hitting this issue too and I was wondering if you plan on working on it or would rather welcome a PR ? After a quick glance, it seems the bug is spread in a few places: https://github.com/search?q=repo%3Aunionai-oss%2Fpandera%20self.drop_invalid_rows(&type=code

Nov 25 '24 14:11 Timost

pandera pandera copied to clipboard

When mixing `drop_invalid_rows` on `DataFrameSchema` and `Column` level we get a non intuitive behavior

Code Sample

Expected behavior

Desktop

pandera
pandera copied to clipboard