pandera
pandera copied to clipboard
Implement parsing functionality in dataframe schemas
Is your feature request related to a problem? Please describe.
Based on the discussion in https://github.com/ericmjl/pyjanitor/issues/703 and https://github.com/pandera-dev/pandera/issues/249 with @ericmjl and @UGuntupalli, one of the ideas to come out of it is that there is a use case for the following data cleaning workflow:
- check the properties of the values in a raw DataFrame
- get a boolean Series/DataFrame indicating where those checks passed/failed
- use the boolean Series/DataFrame for data cleaning/visualization of errors.
Describe the solution you'd like
pandera
should be a utility for data validation, leaving any data manipulation to core pandas
or packages like pyjanitor
. Therefore, it's within the scope of this package to provide users with access to the the results of the Check
s that are performed by schemas after validation (without raising a SchemaError) for them to use for their own purposes.
The solution I'm leaning towards right now is a decorator whose name is still TBD, but the functionality of the decorator (with the placeholder name of parser
) is the illustrated in this code sketch:
import numpy as np
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema({
"col1": Column(checks=pa.Check.greater_than_or_equal_to(0, name="positive")),
"col2": Column(checks=pa.Check.isin(["a", "b", "c"], name="category_abc"))
})
@pa.parser(schema)
def clean_data(df, failed):
"""
:param df: dataframe to clean
:param failed: passed in by `pa.parser` decorator. A boolean dataframe with
the same index as df, where columns are check names. True indicates
failure cases.
"""
clean_df = (
# replace negative values with nans
df.update_where(failed["positive"], "col1", np.nan)
# filter out records with unknown categories
.filter_on(failed["category_abc"], complement=True)
)
return clean_df
def load_data(file):
return (
pd.read_csv(file)
.pipe(clean_data)
# visualization, modeling, etc.
...
)
What pa.parser
does is basically combine check_input and check_output with some extra semantics:
- the schema is validated on the decorated function's input, producing the boolean vectors that @UGuntupalli needs to implement step 3 and 4
- the parser decorator then passes in the results in a dataframe called
failed
in this example as a positional argument - the function body
clean_data
is responsible for cleaning the data so that those failure cases are amended somehow - the parser decorator then re-executes data validation to make sure the function implements the correct data cleaning logic.
- Only
Check
s that output a Series/DataFrame that matches the index of the raw dataframe would be included infailed
The parser
decorator might even have a check_output: bool
kwarg that makes checking the output of the function optional.
Describe alternatives you've considered
- Add a method to
DataFrameSchema
likeget_check_results
to get the boolean Series/DataFrame of passes/failures
as part of this issue, rename the transformer
argument to parser
in the pa.DataFrameSchema
, and it should have the same function signature and behavior as a pa.parser
decorator.
For inspiration, perhaps have a look at pandas_schema. I also found this article providing a solution to the requirements above. What I like about this method is:
- You get a list of all errors
- You can retrieve indexes from rows that failed the validation
- You can see an validation error message
Some room for improvement that I see when tackling this requirement is to provide an extended version of what pandas-schema provides (0,"{row: 2, column: ""dec3""}: ""ee"" is not decimal"):
- Keep the index eg. 0
- Extend the dictionary with more metadata about the error to improve usability eg. {row: 2 , column: ""dec3"" , value: ""ee"" , validation_error: ""decimal_validation""}
- Have a human readable error message eg. ""ee"" is not decimal"
This could help when building pipelines. Certain validation errors are good to know as 'warning' while other validation errors impact the continuation of a pipeline. Being able to filter on those and being able to exporting the validation results (errors) would greatly improve usage of validations in pipelines and traceability Eg. a fail in decimal-validation could result in either:
- Removing these records as further processing will fail otherwise
- Imputing a default to continue if feasible (sidetrack in a pipeline)
def impute_age():
average_age = 45 # or a function based on other values
return average_age
checks=[
# a check that removes the record upon failure
pa.Check(lambda s: decimal_validation(s), element_wise=False, error_msg="value is not a decimal", filter=True),
# a check that updates the value upon failure
pa.Check(lambda s: age_validation(s), element_wise=False, error_msg="age is not considered within normal ranges", filter=False, update=impute_age())
]
thanks for the feedback @Tankske, will consider it when working out the specific implementation for this issue.
The Check
object has an error
argument that serves the same purpose as error_msg
in your code example, so that attribute can be pulled out of a check object.
The main problem this issue is trying to tackle is to help users during the development/debugging process to be able to apply pandas
data transformation logic on rows/columns that don't fulfill the assumptions made in the schema
. This activity would be facilitated by having a data structure that's isomorphic to the thing that you want to operate on (the invalid dataframe), which is why I'm initially thinking that the failed
data structure should, at least in part, be a dataframe.
For error reporting and human readability, I think perhaps something close to the lazy validation output would be nice (which, in fact, I modeled a little after pandas-schema, but in a tabular format), so perhaps failed
should be a namedtuple
containing bool the boolean dataframe and the error reporting data frame:
@pa.parser(schema)
def clean_data(df, failed):
"""
:param df: dataframe to clean
:param failed: passed in by `pa.parser` decorator.
"""
check_results = failed.check_results # boolean dataframe where checks failed
failure_cases = failed.cases # dataframe of failure cases with human-readable output
clean_df = (
# replace negative values with nans
df.update_where(check_results["positive"], "col1", np.nan)
# filter out records with unknown categories
.filter_on(check_results["category_abc"], complement=True)
)
return clean_df
Re: filtering and imputation, I'd be open to expanding pandera
's scope from data validation to data parsing (filtering, imputation, value replacement, etc.), but I'd like to design it carefully so as to not add too much complexity to the user experience. The reason I've scoped this issue the way I did is that it maintains the separation of concerns between validation and data manipulation, delegating the latter to core pandas or tools built on top of it like pyjanitor
.
I think the code sketch that you provided is a good start, and if you'd like to pursue this direction, I'd encourage you to to open up a new issue articulating the problem and solution design that you have in mind.
Hello,
just checking if this functionality can be implemented using available APIs in pandera, or we have to wait for this PR to land.
TIA, Joy
hi @JoyMonteiro this functionality won't be available for another few releases... supporting parsing is something I want to design carefully, the code sketches above are likely not going to be what the final implementation looks like.
To help with the design process, can you describe what your parsing use case is?
I see. thanks for letting me know.
We are trying to build a tool to assist in cleaning/curating data. It would consist of a UI probably made with panel where the user uploads an excel file. This file will be parsed and cleaned (to some extent) in the backend and all rows/columns which don't pass validation checks will be highlighted and displayed to the user.
This would likely be an iterative process until the dataframe reaches a certain data quality (meaning it obeys a detailed schema)
This is where I hoped to use this functionality. Having it in pandera would be nice because the final validation of the cleaned dataframe would be via pandera. Right now we are doing the validation using a bunch of functions, and writing a more maintainable version of the same would imply duplicating things that pandera already does.
Pandera Parsing
As referenced in #542, imo pydantic's validator
decorator is a bit of a misnomer, as what it's doing is (i) parsing raw data values and (ii) emitting an error in the case of invalid ones (the validation bit).
We can map (ii) easily onto pandera's concept of checks, which return booleans mainly to indicate which elements in the dataframe failed the check... ultimately a SchemaError
is raised (or SchemaErrors
in the case of lazy validation), which contains the data on which row/column indexes contained failing values.
pandera is sort of a parsing tool because schema(df)
or schema.validate(df)
is guaranteed to return a valid dataframe or raise an error for values that are not in the domain of the schema specification... what it currently lacks is a way of specifying transformations that can convert potentially invalid raw values into valid values (coerce=True) is currently the only transformation that pandera schemas perform.
Here are a few ways to go about implementing the parsing functionality.
Proposal 1: Parsing as a function
As hinted at in this issue's description: https://github.com/pandera-dev/pandera/issues/252#issue-667327244, this propsal would provide a functional interface for users to parse a raw dataframe given the failure cases produced by a schema:
import numpy as np
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema({
"col1": Column(checks=pa.Check.ge(0)),
"col2": Column(checks=pa.Check.isin(["a", "b", "c"]))
})
@pa.parse(schema)
def clean_data(df, failure_cases):
"""
:param df: dataframe to clean
:param failure_cases: passed in by `pa.parser` decorator. A boolean dataframe with
the same index as df, where columns are check names. True indicates
failure cases.
"""
clean_df = (
# replace negative values with nans
df.update_where(failure_cases["col1"]["ge"], "col1", np.nan)
# filter out records with unknown categories
.filter_on(failure_cases["col2"]["isin"], complement=True)
)
return clean_df
# - SchemaModel syntax -
class Schema(pa.SchemaModel):
col1 = Field(ge=0)
col2 = Field(isin=["a", "b", "c"])
@pa.parse
def clean_data(df: pa.typing.DataFrame[Schema], failure_cases):
...
def load_data(file):
return (
pd.read_csv(file)
.pipe(clean_data)
# visualization, modeling, etc.
...
)
Proposal 2: A single global parser function supplied to a schema
Very similar to proposal 1, but baked into a schema object (also similar to the now-deprecated transformer
kwarg):
schema = pa.DataFrameSchema(
columns={
"col1": Column(checks=pa.Check.ge(0)),
"col2": Column(checks=pa.Check.isin(["a", "b", "c"]))
},
parser=lambda df, failure_cases: ...
)
class Schema(pa.SchemaModel):
col1 = Field(ge=0)
col2 = Field(isin=["a", "b", "c"])
# the parser method might be a reserved name in SchemaModel
@classmethod
def parser(cls, df, failure_cases):
...
Proposal 3: Column- and Dataframe-level parsers
Closer in spirit to pydantic:
schema = pa.DataFrameSchema(
columns={
# a single parser that replaces negative values by 0
"column": pa.Column(int, parsers=pa.Parser(lambda series: series.mask(series <= 0, 0)))
},
)
class Schema(pa.SchemaModel):
column: int
@pa.parser("column")
def column_gt_zero(cls, series)
return series.mask(series <= 0, 0)
# dataframe-level parsing
schema = pa.DataFrameSchema(
parsers=pa.Parser(lambda df: df.mask(df < 0, 0))
)
class Schema(pa.SchemaModel):
class Config:
gt = 0
@pa.dataframe_parser("column")
def column_gt_zero(cls, df, check_results)
return series.mask(~check_results["gt"], 0)
Proposal 4: Parsers as a Special Type of Check
Similar to 3, but instead of introducing a new keyword, introduce Parser
as a special type of Check
. This also harkens back to something @jeffzi proposed here
schema = pa.DataFrameSchema(
columns={
# a single parser that replaces negative values by 0
"column": pa.Column(int, checks=pa.Parser(lambda series: series.mask(series <= 0, 0)))
},
)
The parser function would have the same semantics as pydantic validators, so users can also define parsers that are equivalent to checks:
# parsers can be equivalent to checks
def gt_0(series):
failure_cases = series <= 0
if failure_cases.any():
raise SchemaError(..., failure_cases=failure_cases)
return series
gt_0_parser = pa.Parser(gt_0, ...)
And similar to pydantic, parsers functions might also support depending on checks/parsers that come before it:
schema = pa.DataFrameSchema(
columns={
# a single parser that replaces negative values by 0
"column": pa.Column(
int,
checks=[
pa.Check.gt(0),
pa.Parser(lambda series, check_results: series.mask(~check_results["gt"], 0))
]
)
},
)
Pros and Cons
(1) and (2)
Pros
- Simple to implement and reason about a single function that parses the dataframe
- Don't have to worry about check and parsing order
Cons
- Off-loads the complexity of parsing to the user. The parser function basically is a user-defined function/set of functions that will look like pandas data-cleaning code.
(3) and (4)
Pros
- Each parser has a well-defined scope
- Less complex for the user to implement
- More familiar to pydantic users
Cons
- Many open questions and new ideas potentially need to be introduced into pandera to achieve feature-parity with pydantic, e.g.:
- should parsers be an alternative way of validating dataframes, or should they interact with checks somehow, like having access to check boolean outputs?
- if parsers and checks do interact, should checks be re-run after parsing? Or is the user responsible for making sure the parsers output valid values?
- should there be multiple parsers per column? how about multiple parsers at the dataframe-level? Are there unnecessary complications resulting in multiple parsers per column?
- what is the execution order of parsers?
column-level -> dataframe-level
? vice-versa?
Right now I'm leaning towards (4), but I'm curious what your thoughts are @jeffzi @d-chambers @JoyMonteiro
I think your proposals are not incompatible:
-
(1) and (2) are called after validation and hand in the failed cases. The later is a desirable feature missing from (3), (4).
SchemaErrors.failure_cases
itself is a bit rough imo. -
(3) and (4) are called before validation.
At my work, I created a function that splits out failed cases after validation so that they can be stored and debugged later. SchemaError
could have a method to separate failed cases with their associated errors. That would be a third alternative for post-validation processing, without introducing a new argument or decorator.
try:
events = schema.validate(events, lazy=True)
except pa.errors.SchemaErrors as err:
events = extract_bad_records(err) # split failed cases and append a column "error"
is_bad_event = ~events["error"].isnull()
events = events[~is_bad_event].drop(columns="error")
bad_events = events[is_bad_event]
...
I would pick 1 solution for pre and post validation since they do not have the same purpose. My preference:
- pre: The choice between (3) and (4) depends on the direction of the
Check
api. SpecializedCheck
classes were discussed in #429. Is that option back on the table? - post: I'd rather have failed cases returned from
validate()
or improve the UX of the Exception (see proposal above).
re: naming. attrs uses validator for check
and converter for what you called parser
. Basically attrs
implements (3). Attrs converters are called before validators.
My order of preference would be transformer
>= converter
> parser
. transformer
is a well-known concept, converter
says what it does, and parser
is often associated with loading data from a particular format (e.g. json -> dataframe).
cool, thanks for your feedback @jeffzi, I think it makes sense to distinguish between pre-processing and post-processing. re: naming, I think parse
is a perfectly suitable name for converting data from a raw form into a clean form... in fact, pydantic even makes that distinction:
pydantic is primarily a parsing library, not a validation library. Validation is a means to an end: building a model which conforms to the types and constraints provided.
And there was even an issue discussing the renaming of validate -> parse
though I think for maintainability reasons this wasn't done.
To clarify my thinking around this feature, here's a higher-level proposal about the pandera parsing/validation pipeline order of execution (inspired by the way attrs
does things):
- check the data type or cast the column into that type if
coerce=True
. (perhapscoerce=True
should become the default in a future version) -
parse
the column/dataframe into some normalized set of values (solutions 3 or 4 above) -
check
that constraints are actually met by the parsers -
postprocess
the dataframe for debugging/reporting purposes. Users can also handling check failure cases by whatever means the user deems necessary, but this is outside of the correctness guarantees that pandera provides.
Note that the user interaction model that this implies is that pandera intends step 4 only as a way for the user to further refine the parsing functionality in step 2 in order to fulfill the constraints established by step 3.
I like solution (3), which keeps parsing and checking separate: this way we get a nice continuity with the data synthesis strategies, which is still coupled with Check
s and doesn't have to change as a result of introducing this feature. This means that parsers basically serve as an extension of the dtype
+ coerce=True
combination, allowing users to further refine the contents of a dataframe beyond the the primitive data types.
Totally agreed on better UX for handling schema errors, though I think your proposal needs a little more refinement, since the error
column in the code snippet would need to be denormalized since there can be multiple errors for a single index/column location... I think we can further discuss in a separate issue.
I understand your reasoning about the parser
name, thanks for clarifying. I completely agree with the flow you described :+1:
Totally agreed on better UX for handling schema errors, though I think your proposal needs a little more refinement,
Agreed. The error
columns actually contains all errors related to the row, in a json line format. If an error is schema-wide, the error is duplicated in every row... That's a bit hacky and very specific to my use case. A pandera-native solution should not have so strong assumptions. I'll think about a generalized approach.