pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Add section about `ignore_na` in the Checks user guide

Open cosmicBboy opened this issue 3 years ago • 3 comments

Location of the documentation

https://pandera.readthedocs.io/en/stable/checks.html

Documentation problem

An important piece of functionality is that ignore_na=True in a Check will ~~drop~~ edit: ignore elements with nulls (for columns) and rows with any nulls (for dataframe checks)

Suggested fix for documentation

Add documentation for this in the user guide so that it's highlighted to users.

cosmicBboy avatar Oct 30 '20 13:10 cosmicBboy

I'll work on this.

KyleRConway avatar May 15 '21 11:05 KyleRConway

My current understanding

Trying to replicate the functionality I feel like I must be misunderstanding something. My core understanding is the following:

  1. For Pandera Check ignore_na=True by default (but you can specify anyway with Check(..., ignore_na=True))
  2. "By default, Pandera drops null values before passing the objects to validate into the check function." (bolding mine)
  3. Adding a null value to a dataframe and running a check should not result in an error based on the presence of a null value because it should be dropped before passing for the check.

The issue I'm having

When adding a null value to the example Pandera "Quick Start" dataframe from the README for the project I get the following error:

SchemaError: non-nullable series 'column1' contains null values: {0: nan}

This happens whether I hardcode a None into a column or import numpy as np and do the following before passing for the #define schema section and get an error on validated_df = schema(df):

# Replace some elements with NaN values.
df = df.replace(df.iloc[0][0:],np.NaN)

In both cases I've verified that the dataframe we're passing to the schema(df) correctly drops the columns when manually running a dropna() on a specific column (e.g. df['column1'].dropna())

The code I ran

The above is a good shorthand for what I ran, but here's a messy version in two sections (I was running this in JupyterLab) that was the last failure point before this comment for reference:

import pandas as pd
import pandera as pa
import numpy as np


# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]
})

print(f"Original Dataframe:\n\n {df}")

# Replace some elements with NaN values.
df = df.replace(df.iloc[0][0:],np.NaN)
print(df)
df['column1'].dropna()

and finally the code that has an error: [Note: I've manually forced the ignore_na=True in the checks below, but the same failure results without them]

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10,ignore_na=True)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2,ignore_na=True)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_",ignore_na=True),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2, ignore_na=True)
    ]),
})

validated_df = schema(df)
print(validated_df)

Here's the entire error:

---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
<ipython-input-71-0a24200d3318> in <module>
     11 })
     12 
---> 13 validated_df = schema(df)
     14 print(validated_df)

~/.local/lib/python3.9/site-packages/pandera/schemas.py in __call__(self, dataframe, head, tail, sample, random_state, lazy, inplace)
    644             otherwise creates a copy of the data.
    645         """
--> 646         return self.validate(
    647             dataframe, head, tail, sample, random_state, lazy, inplace
    648         )

~/.local/lib/python3.9/site-packages/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    590                 check_results.append(isinstance(result, pd.DataFrame))
    591             except errors.SchemaError as err:
--> 592                 error_handler.collect_error("schema_component_check", err)
    593             except errors.SchemaErrors as err:
    594                 for schema_error_dict in err.schema_errors:

~/.local/lib/python3.9/site-packages/pandera/error_handlers.py in collect_error(self, reason_code, schema_error, original_exc)
     30         """
     31         if not self._lazy:
---> 32             raise schema_error from original_exc
     33 
     34         # delete data of validated object from SchemaError object to prevent

~/.local/lib/python3.9/site-packages/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    582         for schema_component in schema_components:
    583             try:
--> 584                 result = schema_component(
    585                     df_to_validate,
    586                     lazy=lazy if schema_component.has_subcomponents else None,

~/.local/lib/python3.9/site-packages/pandera/schemas.py in __call__(self, check_obj, head, tail, sample, random_state, lazy, inplace)
   1883     ) -> Union[pd.DataFrame, pd.Series]:
   1884         """Alias for ``validate`` method."""
-> 1885         return self.validate(
   1886             check_obj, head, tail, sample, random_state, lazy, inplace
   1887         )

~/.local/lib/python3.9/site-packages/pandera/schema_components.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    209                     )
    210             else:
--> 211                 validate_column(check_obj, column_name)
    212 
    213         return check_obj

~/.local/lib/python3.9/site-packages/pandera/schema_components.py in validate_column(check_obj, column_name)
    182 
    183         def validate_column(check_obj, column_name):
--> 184             super(Column, copy(self).set_name(column_name)).validate(
    185                 check_obj,
    186                 head,

~/.local/lib/python3.9/site-packages/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
   1773                     series[nulls].head(constants.N_FAILURE_CASES).to_dict(),
   1774                 )
-> 1775                 error_handler.collect_error(
   1776                     "series_contains_nulls",
   1777                     errors.SchemaError(

~/.local/lib/python3.9/site-packages/pandera/error_handlers.py in collect_error(self, reason_code, schema_error, original_exc)
     30         """
     31         if not self._lazy:
---> 32             raise schema_error from original_exc
     33 
     34         # delete data of validated object from SchemaError object to prevent

SchemaError: non-nullable series 'column1' contains null values: {0: nan}

KyleRConway avatar May 15 '21 12:05 KyleRConway

hey @KyleRConway thanks for taking a closer look at this!

This issue definitely needs some clarification/edits. The issue you're experiencing is pretty much the motivation behind improving the docs to explain this behavior better.

In a nutshell, there are two options related to null values in pandera:

  1. Column nullable
  2. Check ignore_na

The error you're seeing SchemaError: non-nullable series 'column1' contains null values: {0: nan} is because of (1), since nullable=False by default.

So the schema that would pass the modifications that provided would be:

schema = pa.DataFrameSchema({
    "column1": pa.Column(int, nullable=True, checks=pa.Check.le(10,ignore_na=True)),
    "column2": pa.Column(float, nullable=True, checks=pa.Check.lt(-1.2,ignore_na=True)),
    "column3": pa.Column(str, nullable=True, checks=[
        pa.Check.str_startswith("value_",ignore_na=True),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2, ignore_na=True)
    ]),
})

The reason there are two independent mechanisms here is to support for the following use cases:

  • I want to assert that a column is not nullable: Column(nullable=False) (default)
  • I want to assert that a column can be nullable: Column(nullable=True)
  • I want to check the properties of non-null entries in a nullable column: Column(nullable=True, checks=pa.Check(lambda s: s > 0, ignore_na=True))
  • I want to check the properties of null entries in a nullable column: Column(nullable=True, checks=pa.Check(lambda s: s.isna().mean() < 0.1, ignore_na=True))

In the last two cases you could do something like

  • Column(nullable=True, checks=pa.Check(lambda s: s.dropna() > 0))
  • Column(nullable=True, checks=pa.Check(lambda s: s[s.isna()] < 0.1))

But that would prevent pandera from providing a granular error report because the index of the boolean Series output of the check function needs to align with the index of the original validated Series in order to report on where exactly the checks failed.

One potential improvement that would make this behavior more intuitive would be to infer that nullable=True if at least one of the checks has ignore_na=True, which I'd totally be on-board with. Let me know what you think about that and we can make another issue with a more detailed description + implementation plan.

For the scope of this issue though, I think it's important to update the docs with the current behavior, namely the docs in https://pandera.readthedocs.io/en/stable/checks.html#handling-null-values need to be updated. The behavior now is that null entries are ignored instead of being dropped.

cosmicBboy avatar May 16 '21 19:05 cosmicBboy