pandera icon indicating copy to clipboard operation
pandera copied to clipboard

check_output : option to turn down checks if output is empty

Open ClaireGouze opened this issue 3 years ago • 10 comments

I'm using the check_output function to check column & datatypes of the DataFrameSchema. My function output can sometimes be an empty dataframe and thus gets a SchemaError, though i would want no error.

Would it be possible to have an option in the check_output function so that no error is raised if output is empty ? Or in the DataFrameSchema ?

Thank you !

ClaireGouze avatar Nov 24 '20 20:11 ClaireGouze

thanks for submitting this feature request @ClaireGouze!

I think this use case should be supported, and here are a few a potential solutions:

  1. add an allow_empty property to the DataFrameSchema and SeriesSchema initializers, such that empty dataframes can pass through without raising a SchemaError. This is nice because it would then cover the check_input case as well.
  2. add an optional option to the check_* decorators, resulting in the same behavior.

I'm leaning on (1), mainly because (2) sort of conflicts with the semantics of Optional[<TYPE>] in the typing module implies that the value can be either None or the <TYPE> specified. allow_empty on the other hand would hold a pandas-specific meaning, which is conceptually cleaner than overloading the "optional" terminology.

Let me know what you think!

cosmicBboy avatar Nov 25 '20 04:11 cosmicBboy

I'm using the check_output function to check column & datatypes of the DataFrameSchema

If you don't have explicit checks, i.e. just checking column names and types, you could set coerce=True. Obviously, whether that's acceptable depends on your project.

import pandera as pa
import pandas as pd


schema = pa.DataFrameSchema({"A": pa.Column(int)})


@pa.check_output(schema)
def make_empty() -> pd.DataFrame:
    return pd.DataFrame({"A": []})


try:
    make_empty()  # fails
except pa.errors.SchemaError as ex:
    print(ex)
#> error in check_output decorator of function 'make_empty': expected series 'A' to have type int64, got float64

schema_coerced = pa.DataFrameSchema({"A": pa.Column(int)}, coerce=True)


@pa.check_output(schema_coerced)
def make_empty_coerced() -> pd.DataFrame:
    return pd.DataFrame({"A": []})


make_empty_coerced()  # ok
#> Empty DataFrame
Columns: [A]
Index: []

Created on 2020-11-25 by the reprexpy package

If the DataFrame is empty, we can only validate names and types. I think an argument allow_empty should still validate types. Pandera could offer a helper method DataFrameSchema.coerce_dtypes() to let the user coerce locally when the DataFrame is empty. That way coerce can be kept to False globally if that's desirable.

Regarding solution 2., one problem is that you would allow empty DataFrames locally but later validations could fail if optional=True was not set down the pipeline. Moreover, DataFrameSchema.validate() would also need an optional argument if we want to keep a 1:1 mapping with decorator functionalities.

jeffzi avatar Nov 25 '20 10:11 jeffzi

I think an argument allow_empty should still validate types.

👍

cosmicBboy avatar Nov 25 '20 14:11 cosmicBboy

Thanks for your reply, i think the solution #1 you mentioned would be suitable.

If you don't have explicit checks, i.e. just checking column names and types, you could set coerce=True. Obviously, whether that's acceptable depends on your project.

This would be a good solution but if the output is just an empty dataframe with no column name, it will still fail.

ClaireGouze avatar Nov 26 '20 08:11 ClaireGouze

What you are asking for is actually to completely disable validation.

I propose to introduce both arguments:

  1. Argument allow_empty for DataFrameSchema/SeriesSchema which still checks names and types on empty DataFrames. Example use cases are dry runs or reading from a source that can be empty. The semantic is that we processed the data successfully but the output is empty.

  2. Argument optional for all check decorators which disables validation when passed a None object. That behavior would be aligned with typing.Optional. The semantic is slightly different than 1. It would signal the fact that we could not process the DataFrame but that's within expectations therefore we do not want to raise an error.

SchemaModel coupled with the decorator check_types already implements 2.

import pandera as pa
from pandera.typing import Series, DataFrame
import pandas as pd
from typing import Optional


class Schema(pa.SchemaModel):
    A: Series[int]


@pa.check_types()
def make_empty() -> Optional[DataFrame[Schema]]:
    return pd.DataFrame()


try:
    make_empty()  # fails
except pa.errors.SchemaError as ex:
    print(ex)
#> error in check_types decorator of function 'make_empty': column 'A' not in dataframe
#> Empty DataFrame
#> Columns: []
#> Index: []


@pa.check_types()
def maybe_df() -> Optional[DataFrame[Schema]]:
    return None


maybe_df() # ok

Created on 2020-11-26 by the reprexpy package

jeffzi avatar Nov 26 '20 11:11 jeffzi

I think the allow_empty option at the schema-level and optional option for object-based API check_* decorators makes sense.

For the latter, I'm thinking something like this:

import pandas as pd
import pandera as pa

from typing import Optional


schema = pa.DataFrameSchema({
    "col": pa.Column(int)
})


@pa.check_input(schema, optional=True)
def check_input_transform(df):  # or None
    return df


@pa.check_output(schema, optional=True)
def check_output_transform(df):
    return df  # or None


@pa.check_io(df=schema, out=schema, optional={"df": True, "out": True})
def check_output_transform(df):
    return df  # or None


@pa.check_io(
    df=schema, out=(1, schema), optional={"df": True, "out": {1: True}}
)
def check_tuple_output_transform(df):  # or None
    return "foo", df  # or None


@pa.check_io(
    df=schema, out=("bar", schema), optional={"df": True, "out": {"bar": True}}
)
def check_mapping_output_transform(df):  # or None
    return {
        "foo": 1,
        "bar": df,  # or None
    }

This would be a good solution but if the output is just an empty dataframe with no column name, it will still fail.

@ClaireGouze can you provide example code for your use case? I'm trying to wrap my head around the case where a function returns an empty dataframe with no columns, in which case my intuition is that the function should return None instead of pd.DataFrame()

cosmicBboy avatar Nov 26 '20 15:11 cosmicBboy

going to work on this after 0.6.0 release, should be out by next week

cosmicBboy avatar Dec 03 '20 01:12 cosmicBboy

what's the status of this issue? at my work, we have a data manipulation function which returns a dataframe that should follow a schema, and we use check_types to validate the dataframe against the schema - however, the validator fails when the dataframe is empty (an empty dataframe is a valid output from the function). a column that's typically typed as float gets the pandas dtype object when the dataframe is empty. we can work around this in the short-term by coercing the type on that column, but this will continue to cause issues for us going forward.

ndepaola avatar Mar 01 '23 05:03 ndepaola

+1 on this, also facing this issue when empty dataframes are being used. is the suggested solution in the current version of pandera to use the required keyword and specify all columns to be false with it? https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#required

einarjohnson avatar Jan 16 '24 07:01 einarjohnson