pandera
pandera copied to clipboard
Add section about `ignore_na` in the Checks user guide
Location of the documentation
https://pandera.readthedocs.io/en/stable/checks.html
Documentation problem
An important piece of functionality is that ignore_na=True
in a Check will ~~drop~~ edit: ignore elements with nulls (for columns) and rows with any nulls (for dataframe checks)
Suggested fix for documentation
Add documentation for this in the user guide so that it's highlighted to users.
I'll work on this.
My current understanding
Trying to replicate the functionality I feel like I must be misunderstanding something. My core understanding is the following:
- For Pandera
Check
ignore_na=True
by default (but you can specify anyway withCheck(..., ignore_na=True)
) - "By default, Pandera drops null values before passing the objects to validate into the check function." (bolding mine)
- Adding a null value to a dataframe and running a check should not result in an error based on the presence of a null value because it should be dropped before passing for the check.
The issue I'm having
When adding a null value to the example Pandera "Quick Start" dataframe from the README for the project I get the following error:
SchemaError: non-nullable series 'column1' contains null values: {0: nan}
This happens whether I hardcode a None
into a column or import numpy as np
and do the following before passing for the #define schema
section and get an error on validated_df = schema(df)
:
# Replace some elements with NaN values.
df = df.replace(df.iloc[0][0:],np.NaN)
In both cases I've verified that the dataframe we're passing to the schema(df)
correctly drops the columns when manually running a dropna()
on a specific column (e.g. df['column1'].dropna()
)
The code I ran
The above is a good shorthand for what I ran, but here's a messy version in two sections (I was running this in JupyterLab) that was the last failure point before this comment for reference:
import pandas as pd
import pandera as pa
import numpy as np
# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]
})
print(f"Original Dataframe:\n\n {df}")
# Replace some elements with NaN values.
df = df.replace(df.iloc[0][0:],np.NaN)
print(df)
df['column1'].dropna()
and finally the code that has an error:
[Note: I've manually forced the ignore_na=True
in the checks below, but the same failure results without them]
# define schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, checks=pa.Check.le(10,ignore_na=True)),
"column2": pa.Column(float, checks=pa.Check.lt(-1.2,ignore_na=True)),
"column3": pa.Column(str, checks=[
pa.Check.str_startswith("value_",ignore_na=True),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2, ignore_na=True)
]),
})
validated_df = schema(df)
print(validated_df)
Here's the entire error:
---------------------------------------------------------------------------
SchemaError Traceback (most recent call last)
<ipython-input-71-0a24200d3318> in <module>
11 })
12
---> 13 validated_df = schema(df)
14 print(validated_df)
~/.local/lib/python3.9/site-packages/pandera/schemas.py in __call__(self, dataframe, head, tail, sample, random_state, lazy, inplace)
644 otherwise creates a copy of the data.
645 """
--> 646 return self.validate(
647 dataframe, head, tail, sample, random_state, lazy, inplace
648 )
~/.local/lib/python3.9/site-packages/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
590 check_results.append(isinstance(result, pd.DataFrame))
591 except errors.SchemaError as err:
--> 592 error_handler.collect_error("schema_component_check", err)
593 except errors.SchemaErrors as err:
594 for schema_error_dict in err.schema_errors:
~/.local/lib/python3.9/site-packages/pandera/error_handlers.py in collect_error(self, reason_code, schema_error, original_exc)
30 """
31 if not self._lazy:
---> 32 raise schema_error from original_exc
33
34 # delete data of validated object from SchemaError object to prevent
~/.local/lib/python3.9/site-packages/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
582 for schema_component in schema_components:
583 try:
--> 584 result = schema_component(
585 df_to_validate,
586 lazy=lazy if schema_component.has_subcomponents else None,
~/.local/lib/python3.9/site-packages/pandera/schemas.py in __call__(self, check_obj, head, tail, sample, random_state, lazy, inplace)
1883 ) -> Union[pd.DataFrame, pd.Series]:
1884 """Alias for ``validate`` method."""
-> 1885 return self.validate(
1886 check_obj, head, tail, sample, random_state, lazy, inplace
1887 )
~/.local/lib/python3.9/site-packages/pandera/schema_components.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
209 )
210 else:
--> 211 validate_column(check_obj, column_name)
212
213 return check_obj
~/.local/lib/python3.9/site-packages/pandera/schema_components.py in validate_column(check_obj, column_name)
182
183 def validate_column(check_obj, column_name):
--> 184 super(Column, copy(self).set_name(column_name)).validate(
185 check_obj,
186 head,
~/.local/lib/python3.9/site-packages/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
1773 series[nulls].head(constants.N_FAILURE_CASES).to_dict(),
1774 )
-> 1775 error_handler.collect_error(
1776 "series_contains_nulls",
1777 errors.SchemaError(
~/.local/lib/python3.9/site-packages/pandera/error_handlers.py in collect_error(self, reason_code, schema_error, original_exc)
30 """
31 if not self._lazy:
---> 32 raise schema_error from original_exc
33
34 # delete data of validated object from SchemaError object to prevent
SchemaError: non-nullable series 'column1' contains null values: {0: nan}
hey @KyleRConway thanks for taking a closer look at this!
This issue definitely needs some clarification/edits. The issue you're experiencing is pretty much the motivation behind improving the docs to explain this behavior better.
In a nutshell, there are two options related to null values in pandera:
The error you're seeing SchemaError: non-nullable series 'column1' contains null values: {0: nan}
is because of (1), since nullable=False
by default.
So the schema that would pass the modifications that provided would be:
schema = pa.DataFrameSchema({
"column1": pa.Column(int, nullable=True, checks=pa.Check.le(10,ignore_na=True)),
"column2": pa.Column(float, nullable=True, checks=pa.Check.lt(-1.2,ignore_na=True)),
"column3": pa.Column(str, nullable=True, checks=[
pa.Check.str_startswith("value_",ignore_na=True),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2, ignore_na=True)
]),
})
The reason there are two independent mechanisms here is to support for the following use cases:
- I want to assert that a column is not nullable:
Column(nullable=False)
(default) - I want to assert that a column can be nullable:
Column(nullable=True)
- I want to check the properties of non-null entries in a nullable column:
Column(nullable=True, checks=pa.Check(lambda s: s > 0, ignore_na=True))
- I want to check the properties of null entries in a nullable column:
Column(nullable=True, checks=pa.Check(lambda s: s.isna().mean() < 0.1, ignore_na=True))
In the last two cases you could do something like
-
Column(nullable=True, checks=pa.Check(lambda s: s.dropna() > 0))
-
Column(nullable=True, checks=pa.Check(lambda s: s[s.isna()] < 0.1))
But that would prevent pandera
from providing a granular error report because the index of the boolean Series output of the check function needs to align with the index of the original validated Series in order to report on where exactly the checks failed.
One potential improvement that would make this behavior more intuitive would be to infer that nullable=True
if at least one of the checks has ignore_na=True
, which I'd totally be on-board with. Let me know what you think about that and we can make another issue with a more detailed description + implementation plan.
For the scope of this issue though, I think it's important to update the docs with the current behavior, namely the docs in https://pandera.readthedocs.io/en/stable/checks.html#handling-null-values need to be updated. The behavior now is that null entries are ignored instead of being dropped.