Dask DataFrame filter fails

Open gg314 opened this issue 3 years ago • 0 comments

Describe the bug Dask Dataframes validated with strict='filter' do not drop extraneous columns.

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandera.

Code Sample

import pandas as pd
import pandera as pa
import dask.dataframe as dd

df1 = pd.DataFrame([[1, 1], [3, 2], [5, 3]], columns=["col1", "col2"])
df2 = dd.from_pandas(df1, npartitions=1)

my_schema = pa.DataFrameSchema(
    {
        "col2": pa.Column(int),
    },
    strict="filter",
)

new_df1 = my_schema(df1)
new_df2 = my_schema(df2)

Expected behavior

DataFrames should be filtered such that col2 remains and col1 is dropped. The validated pandas DataFrame new_df1 behaves as expected. However, the resulting Dask DataFrame new_df2 retains both columns.

Additional context

Apologies if this falls under the wider net of #119. I am interpreting that issue as pertaining to more complex memory management problems. Thanks for your help.

Jun 01 '22 14:06 gg314