pandera Improved error message for `SchemaErrors`

Is your feature request related to a problem? Please describe. Printed SchemaErrors do not provide much value on their own (see below). It is hard to understand, when reading such a message in a debugger, repl, or monitoring tool, what exactly is wrong. In this case, there is 1 failing check, and that is because the report_date is not in the schema. This takes some practice to spot and even then you see there is a contradictory column_in_schema which actually means SchemaErrorReason.COLUMN_NOT_IN_SCHEMA ie, the column is not present in the schema.

Schema None: A total of 1 schema errors were found.

Error Counts
------------
- SchemaErrorReason.COLUMN_NOT_IN_SCHEMA: 1

Schema Error Summary
--------------------
                                         failure_cases  n_failure_cases
schema_context  column check                                           
DataFrameSchema <NA>   column_in_schema  [report_date]                1

Usage Tip
---------

Directly inspect all errors by catching the exception:

``
try:
    schema.validate(dataframe, lazy=True)
except SchemaErrors as err:
    err.failure_cases  # dataframe of schema errors
    err.data  # invalid dataframe
``

For context, here is what a printed SchemaError (singular) looks like:

<Schema Column(name=amount_currency, type=DataType(str))> failed element-wise validator 0:
<Check isin: isin(['GBP'])>
failure cases:
     index failure_case
0        0          TRY
1        1          TRY
2        2          TRY
3        3          TRY
4        4          TRY
..     ...          ...
154    154          TRY
155    155          TRY
156    156          TRY
157    157          TRY
158    158          TRY

[159 rows x 2 columns]

Here we are able to see example of what went wrong, and because it is an element-wise check, we can see the failing indices.

Describe the solution you'd like I'd like to open a PR to clean up this messaging. Some things I would like to change:

Remove the Usage Tip
Provide a snippet for each of the SchemaErrors (possibly behind a verbose=True error message)
Explain the nature of each error in plain text -> "report_date is not specified in the FooBar schema" for example

This would only change the error messaging and not add new error classes (as that would break existing interfaces)

Describe alternatives you've considered Creating some custom error messages in my project, but I think many others could benefit from updated error messaging.

Additional context

We need to prevent loading entire datasets into memory when producing these error messages. For large out of memory dataframes this would be a real performance hit, possibly catastrophically!
confine ourselves (at first) to the existing interfaces -> not add new requirements to the SchemaError/s classes in terms of what they expect to be passed

Jul 25 '23 16:07 kykyi

@cosmicBboy happy to formulate some mockups if this sounds like something we should devote time to?

Jul 25 '23 16:07 kykyi

@cosmicBboy still keen to improve these errors, shall I just open a PR?

Oct 09 '23 09:10 kykyi

+1, improved readability of errors would be a massive quality of life improvement.

Jan 15 '24 07:01 austinzhang1018

@cosmicBboy I am going to start working on a PR for this 🛩️

Jan 18 '24 02:01 kykyi

+1, the current error message logic is hard to decipher. The usage tip advises that we should produce our own messages from the gathered information, but I don't believe there is any documentation on SchemaError. Also, handling the error report seperate from the check object seems counter to the design philosophy of Pandera. I would like to see more detailed, context aware error reports. For example with this custom check:

@extensions.register_check_method(statistics=['col_a','col_b'])
def col_a_less_than_col_b(df, *, col_a: str, col_b: str):
    return df[col_a]<df[col_b]

The default error message is:

E           pandera.errors.SchemaError: <Schema DataFrameSchema(
E               columns={
E                   'p_idx': <Schema Column(name=p_idx, type=DataType(Int64))>
                      ...
E                   'pb_right': <Schema Column(name=pb_right, type=DataType(Float64))>
E               },
E               checks=[
E                   <Check col_a_less_than_col_b>
E               ],
E               coerce=False,
E               dtype=None,
E               index=<Schema Index(name=idx, type=DataType(int64))>,
E               strict=True,
E               name=None,
E               ordered=False,
E               unique_column_names=False,
E               metadata=None, 
E               add_missing_columns=False
E           )> failed element-wise validator 0:
E           <Check col_a_less_than_col_b>
E           failure cases:
E                 column  index  failure_case
E           0      p_idx      0             0
E           1      p_idx      1             1
E           2      p_idx      2             2
E           3      p_idx      3             3
E           4   time_idx      0          1507
E           ..       ...    ...           ...
E           63   pb_left      3  10278.000332
E           64  pb_right      0        9932.0
E           65  pb_right      1   2018.912258
E           66  pb_right      2   9376.814997
E           67  pb_right      3  11755.987964
E           
E           [68 rows x 3 columns]

Which gives me no information on how the check failed. Ideally I would return a wide frame restricted to the rows and columns that failed the check. In my example, the columns checked by the registered check were 'whh_width' and 'pb_width', and I am expecting 'whh_width' to be less than 'pb_width'. On failure, I need the error report to contain the following at a bare minimum to understand the problem:

  whh_width         pb_width          col_a_less_than_col_b

0 9244.513493 9244.513493 False 1 282.912258 282.912258 False 2 2921.814997 2921.814997 False 3 1477.987631 1477.987631 False

Which can be produced with:

fail_cases = schema_error.failure_cases.pivot(index='index', columns = 'column', values='failure_case')[schema_error.check.statistics.values()]

check_result = schema_error.check_output
check_result.name = schema_error.check.name

report = pd.concat([fail_cases, check_result],axis=1)

within the context of SchemaErrorHandler. No idea how scalable that is though.

Jan 18 '24 03:01 jonathanstathakis

How does this style of error messaging look (inspired by polars). Planning on limiting output to 10 rows (can be configured maybe?)

Schema MySchema: A total of 2 schema errors were found.
┌───────┬─────────┬──────────────────────┬──────────────┬────────────────┬──────────────┐
│ index ┆  column ┆        check         ┆ failure_case ┆ schema_context ┆ check_number │
╞═══════╪═════════╪══════════════════════╪══════════════╪════════════════╪══════════════╡
│ 0     │ flavour │ isin(['coke', '7up', │ pepsi        │ Column         │ 0            │
│       │         │ 'mountain_dew'])     │              │                │              │
│ 1     │ flavour │ isin(['coke', '7up', │ fanta        │ Column         │ 0            │
│       │         │ 'mountain_dew'])     │              │                │              │
└───────┴─────────┴──────────────────────┴──────────────┴────────────────┴──────────────┘

To be confirmed:

How this works with custom or wide checks (such as when a column in not in the schema so failure_case is NA)
How this looks with verbose column names or check names

Jan 19 '24 04:01 kykyi

How does this style of error messaging look (inspired by polars). Planning on limiting output to 10 rows (can be configured maybe?)


Schema MySchema: A total of 2 schema errors were found.

┌───────┬─────────┬──────────────────────┬──────────────┬────────────────┬──────────────┐

│ index ┆  column ┆        check         ┆ failure_case ┆ schema_context ┆ check_number │

╞═══════╪═════════╪══════════════════════╪══════════════╪════════════════╪══════════════╡

│ 0     │ flavour │ isin(['coke', '7up', │ pepsi        │ Column         │ 0            │

│       │         │ 'mountain_dew'])     │              │                │              │

│ 1     │ flavour │ isin(['coke', '7up', │ fanta        │ Column         │ 0            │

│       │         │ 'mountain_dew'])     │              │                │              │

└───────┴─────────┴──────────────────────┴──────────────┴────────────────┴──────────────┘

To be confirmed:

How this works with custom or wide checks (such as when a column in not in the schema so failure_case is NA)
How this looks with verbose column names or check names

thats exactly what I had in mind. I think for custom checks you could mask the check_obj with the boolean series returned by the check to present the failure cases. In the case of a dataframe custom check I think a False should be enough, instead relying on a descriptive check name that is answered by the bool.

Jan 19 '24 13:01 jonathanstathakis

The above PR is merged @jonathanstathakis @austinzhang1018 if you want to take a look! It's not what was specified in the comments, but follows a more readable 'report' format building on what was implemented for pyspark schemas. A future contribution could add a table or something, but best to wait for some feedback to see what users think of the new format maybe?

Mar 08 '24 21:03 kykyi

Cutting a 0.18.1 release this weekend to capture these changes

Mar 09 '24 05:03 cosmicBboy

pandera pandera copied to clipboard

Improved error message for `SchemaErrors`

pandera
pandera copied to clipboard