pandera
pandera copied to clipboard
Dask DataFrame filter fails
Describe the bug
Dask Dataframes validated with strict='filter' do not drop extraneous columns.
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera.
Code Sample
import pandas as pd
import pandera as pa
import dask.dataframe as dd
df1 = pd.DataFrame([[1, 1], [3, 2], [5, 3]], columns=["col1", "col2"])
df2 = dd.from_pandas(df1, npartitions=1)
my_schema = pa.DataFrameSchema(
{
"col2": pa.Column(int),
},
strict="filter",
)
new_df1 = my_schema(df1)
new_df2 = my_schema(df2)
Expected behavior
DataFrames should be filtered such that col2 remains and col1 is dropped. The validated pandas DataFrame new_df1 behaves as expected. However, the resulting Dask DataFrame new_df2 retains both columns.
Additional context
Apologies if this falls under the wider net of #119. I am interpreting that issue as pertaining to more complex memory management problems. Thanks for your help.