pandera
pandera copied to clipboard
Soft schema check decorators which log schema errors rather than raising an exception
Is your feature request related to a problem? Please describe.
A pandera schema check that fails raises an exception. This is useful to catch data errors early that would cause the code downstream to crash. However, in some cases I would like to use pandera to state assumptions about the data that need not be true for the code to run. I want to be alerted that these assumptions are not valid, however, I do not want an exception that I need to handle, rather e.g. a warning in my log file.
Describe the solution you'd like
I wrote the following decorator and am using it productively in a project. I believe a decorator with this functionality would be a useful addition to pandera.
from pandera import (
DataFrameSchema,
Check,
Index,
Column,
Int,
String,
Float,
Bool,
Category,
)
import pandera
import pandas
import inspect
import logging
import functools
from pathlib import Path
import os
def check(strict=False, export_errors=True, **kwargs):
"""
A drop-in replacement for pandera.check_io which catches and logs
the SchemaError.
"""
def _log_schema_error(
func_name: str, data_name: str, err: pandera.errors.SchemaError
):
message = (
f"Schema validation in '{func_name}' with '{data_name}' raised error: {err}\n"
+ f"Schema failure cases:\n{err.failure_cases}\n"
)
logging.error(message)
def _export_schema_error(
schema_name: str,
func_name: str,
data_name: str,
err: pandera.errors.SchemaError,
):
"""Export schema error to .xlsx file."""
try:
error_dir = Path("data/09_validation/schema_errors")
file_name = f"{schema_name}-{func_name}-{data_name}.xlsx"
writer = pandas.ExcelWriter(error_dir / file_name)
pandas.DataFrame({"error": [err]}).to_excel(writer, "error message")
err.failure_cases.to_excel(writer, "failure cases")
writer.save()
except Exception as ex:
logging.error("Schema failure cases cannot be exporte")
logging.exception(ex)
check_args = kwargs
def decorator(transform):
@functools.wraps(transform)
def transform_wrapper(*args, **kwargs):
argspec = inspect.getfullargspec(transform)
named_inputs = dict(zip(argspec.args, args))
# validate input schemas
for (df_name, df_schema) in check_args.items():
if df_name == "out":
continue
try:
df_schema.validate(named_inputs[df_name])
except pandera.errors.SchemaError as err:
_log_schema_error(
func_name=transform.__name__, data_name=df_name, err=err
)
if export_errors:
_export_schema_error(
schema_name=df_schema.name,
func_name=transform.__name__,
data_name=df_name,
err=err,
)
# apply data transform
out = transform(*args, **kwargs)
# validate output schema
if "out" in check_args:
try:
check_args["out"].validate(out)
except pandera.errors.SchemaError as err:
_log_schema_error(
func_name=transform.__name__, data_name="out", err=err
)
if export_errors:
_export_schema_error(
schema_name=check_args["out"].name,
func_name=transform.__name__,
data_name=df_name,
err=err,
)
if strict:
raise err
return out
return transform_wrapper
return decorator
This is awesome @clstaudt !
I think this use case is probably pretty common and I'd love to fold this into the pandera codebase.
Here's a proposal for how to do this:
- create a private decorator
decorators._log_error
, which implements the logging and exporting logic in your above code snippet. The user needs to provide anerror_exporter
, since we don't really want to be opinionated about where and in what format these error logs should be written (maybe eventually we think of reasonable defaults for this) - use this decorator in the
check_{input, output, io, types}
functions to catch and handle those errors correctly
Something like:
def _log_error(
fn: Callable,
error_exporter: Callable = None,
# other arguments here, e.g. error directory and file name format?
):
@wrapt.decorator
def _wrapper(wrapped, instance, *args, **kwargs):
try:
return fn(*args, **kwargs)
except (SchemaError, SchemaErrors) as exc:
# your error handling logic here
error_exporter(exc)
return _wrapper(fn)
def check_io(..., raise_error: bool = True, error_exporter: Callable = None):
@wrapt.decorator
def _wrapper(...):
...
if raise_error:
return _wrapper
return _log_error(_wrapper, error_exporter)
What do you think @clstaudt ?
@cosmicBboy
So usage would look like this?
@check_io(
data=schema.foo,
out=schema.bar,
raise_error=False,
error_exporter=my_error_logger,
)
def data_transform(data: DataFrame) -> DataFrame:
...
return data_transformed
I think that would be great to have in pandera. Can you integrate this feature based on my code example?
@clstaudt yep! that would be the user-facing API.
Can you integrate this feature based on my code example?
If you have a capacity to make a PR contributation that would be great! If not I'll get to it eventually but can't say when it'll be done.
I'll also add a help wanted
tag to this issue in case anyone else in the community wants to take it on... with your code example it would be pretty straight-forward.
@cosmicBboy I had a look but the integration wasn't straightforward for me at this point. I'd like to leave the PR to someone who already knows the pandera codebase.