pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Soft schema check decorators which log schema errors rather than raising an exception

Open clstaudt opened this issue 3 years ago • 4 comments

Is your feature request related to a problem? Please describe.

A pandera schema check that fails raises an exception. This is useful to catch data errors early that would cause the code downstream to crash. However, in some cases I would like to use pandera to state assumptions about the data that need not be true for the code to run. I want to be alerted that these assumptions are not valid, however, I do not want an exception that I need to handle, rather e.g. a warning in my log file.

Describe the solution you'd like

I wrote the following decorator and am using it productively in a project. I believe a decorator with this functionality would be a useful addition to pandera.

from pandera import (
    DataFrameSchema,
    Check,
    Index,
    Column,
    Int,
    String,
    Float,
    Bool,
    Category,
)
import pandera
import pandas
import inspect
import logging
import functools
from pathlib import Path
import os


def check(strict=False, export_errors=True, **kwargs):
    """
    A drop-in replacement for pandera.check_io which catches and logs
    the SchemaError.
    """

    def _log_schema_error(
        func_name: str, data_name: str, err: pandera.errors.SchemaError
    ):
        message = (
            f"Schema validation in '{func_name}' with '{data_name}' raised error: {err}\n"
            + f"Schema failure cases:\n{err.failure_cases}\n"
        )
        logging.error(message)

    def _export_schema_error(
        schema_name: str,
        func_name: str,
        data_name: str,
        err: pandera.errors.SchemaError,
    ):
        """Export schema error to .xlsx file."""
        try:
            error_dir = Path("data/09_validation/schema_errors")
            file_name = f"{schema_name}-{func_name}-{data_name}.xlsx"
            writer = pandas.ExcelWriter(error_dir / file_name)
            pandas.DataFrame({"error": [err]}).to_excel(writer, "error message")
            err.failure_cases.to_excel(writer, "failure cases")
            writer.save()
        except Exception as ex:
            logging.error("Schema failure cases cannot be exporte")
            logging.exception(ex)

    check_args = kwargs

    def decorator(transform):
        @functools.wraps(transform)
        def transform_wrapper(*args, **kwargs):
            argspec = inspect.getfullargspec(transform)
            named_inputs = dict(zip(argspec.args, args))
            # validate input schemas
            for (df_name, df_schema) in check_args.items():
                if df_name == "out":
                    continue
                try:
                    df_schema.validate(named_inputs[df_name])
                except pandera.errors.SchemaError as err:
                    _log_schema_error(
                        func_name=transform.__name__, data_name=df_name, err=err
                    )
                    if export_errors:
                        _export_schema_error(
                            schema_name=df_schema.name,
                            func_name=transform.__name__,
                            data_name=df_name,
                            err=err,
                        )
            # apply data transform
            out = transform(*args, **kwargs)
            # validate output schema
            if "out" in check_args:
                try:
                    check_args["out"].validate(out)
                except pandera.errors.SchemaError as err:
                    _log_schema_error(
                        func_name=transform.__name__, data_name="out", err=err
                    )
                    if export_errors:
                        _export_schema_error(
                            schema_name=check_args["out"].name,
                            func_name=transform.__name__,
                            data_name=df_name,
                            err=err,
                        )
                    if strict:
                        raise err
            return out

        return transform_wrapper

    return decorator

clstaudt avatar Jan 21 '22 14:01 clstaudt

This is awesome @clstaudt !

I think this use case is probably pretty common and I'd love to fold this into the pandera codebase.

Here's a proposal for how to do this:

  1. create a private decorator decorators._log_error, which implements the logging and exporting logic in your above code snippet. The user needs to provide an error_exporter, since we don't really want to be opinionated about where and in what format these error logs should be written (maybe eventually we think of reasonable defaults for this)
  2. use this decorator in the check_{input, output, io, types} functions to catch and handle those errors correctly

Something like:

def _log_error(
    fn: Callable,
    error_exporter: Callable = None,
    # other arguments here, e.g. error directory and file name format?
):
    @wrapt.decorator
    def _wrapper(wrapped, instance, *args, **kwargs):
        try:
            return fn(*args, **kwargs)
        except (SchemaError, SchemaErrors) as exc:
            # your error handling logic here
            error_exporter(exc)

    return _wrapper(fn)


def check_io(..., raise_error: bool = True, error_exporter: Callable = None):
    @wrapt.decorator
    def _wrapper(...):
        ...

    if raise_error:
        return _wrapper

    return _log_error(_wrapper, error_exporter)

What do you think @clstaudt ?

cosmicBboy avatar Feb 10 '22 15:02 cosmicBboy

@cosmicBboy

So usage would look like this?

@check_io(
    data=schema.foo,
    out=schema.bar,
    raise_error=False,
    error_exporter=my_error_logger,
)
def data_transform(data: DataFrame) -> DataFrame:
    ...
    return data_transformed

I think that would be great to have in pandera. Can you integrate this feature based on my code example?

clstaudt avatar Feb 10 '22 15:02 clstaudt

@clstaudt yep! that would be the user-facing API.

Can you integrate this feature based on my code example?

If you have a capacity to make a PR contributation that would be great! If not I'll get to it eventually but can't say when it'll be done.

I'll also add a help wanted tag to this issue in case anyone else in the community wants to take it on... with your code example it would be pretty straight-forward.

cosmicBboy avatar Feb 15 '22 03:02 cosmicBboy

@cosmicBboy I had a look but the integration wasn't straightforward for me at this point. I'd like to leave the PR to someone who already knows the pandera codebase.

clstaudt avatar Feb 23 '22 18:02 clstaudt