pandera
pandera copied to clipboard
Improved error message for `SchemaErrors`
Is your feature request related to a problem? Please describe.
Printed SchemaErrors
do not provide much value on their own (see below). It is hard to understand, when reading such a message in a debugger, repl, or monitoring tool, what exactly is wrong. In this case, there is 1 failing check, and that is because the report_date
is not in the schema. This takes some practice to spot and even then you see there is a contradictory column_in_schema
which actually means SchemaErrorReason.COLUMN_NOT_IN_SCHEMA
ie, the column is not present in the schema.
Schema None: A total of 1 schema errors were found.
Error Counts
------------
- SchemaErrorReason.COLUMN_NOT_IN_SCHEMA: 1
Schema Error Summary
--------------------
failure_cases n_failure_cases
schema_context column check
DataFrameSchema <NA> column_in_schema [report_date] 1
Usage Tip
---------
Directly inspect all errors by catching the exception:
``
try:
schema.validate(dataframe, lazy=True)
except SchemaErrors as err:
err.failure_cases # dataframe of schema errors
err.data # invalid dataframe
``
For context, here is what a printed SchemaError
(singular) looks like:
<Schema Column(name=amount_currency, type=DataType(str))> failed element-wise validator 0:
<Check isin: isin(['GBP'])>
failure cases:
index failure_case
0 0 TRY
1 1 TRY
2 2 TRY
3 3 TRY
4 4 TRY
.. ... ...
154 154 TRY
155 155 TRY
156 156 TRY
157 157 TRY
158 158 TRY
[159 rows x 2 columns]
Here we are able to see example of what went wrong, and because it is an element-wise check, we can see the failing indices.
Describe the solution you'd like I'd like to open a PR to clean up this messaging. Some things I would like to change:
- Remove the
Usage Tip
- Provide a snippet for each of the
SchemaError
s (possibly behind averbose=True
error message) - Explain the nature of each error in plain text -> "
report_date
is not specified in theFooBar
schema" for example
This would only change the error messaging and not add new error classes (as that would break existing interfaces)
Describe alternatives you've considered Creating some custom error messages in my project, but I think many others could benefit from updated error messaging.
Additional context
- We need to prevent loading entire datasets into memory when producing these error messages. For large out of memory dataframes this would be a real performance hit, possibly catastrophically!
- confine ourselves (at first) to the existing interfaces -> not add new requirements to the
SchemaError
/s classes in terms of what they expect to be passed
@cosmicBboy happy to formulate some mockups if this sounds like something we should devote time to?
@cosmicBboy still keen to improve these errors, shall I just open a PR?
+1, improved readability of errors would be a massive quality of life improvement.
@cosmicBboy I am going to start working on a PR for this đŠī¸
+1, the current error message logic is hard to decipher. The usage tip advises that we should produce our own messages from the gathered information, but I don't believe there is any documentation on SchemaError
. Also, handling the error report seperate from the check object seems counter to the design philosophy of Pandera. I would like to see more detailed, context aware error reports. For example with this custom check:
@extensions.register_check_method(statistics=['col_a','col_b'])
def col_a_less_than_col_b(df, *, col_a: str, col_b: str):
return df[col_a]<df[col_b]
The default error message is:
E pandera.errors.SchemaError: <Schema DataFrameSchema(
E columns={
E 'p_idx': <Schema Column(name=p_idx, type=DataType(Int64))>
...
E 'pb_right': <Schema Column(name=pb_right, type=DataType(Float64))>
E },
E checks=[
E <Check col_a_less_than_col_b>
E ],
E coerce=False,
E dtype=None,
E index=<Schema Index(name=idx, type=DataType(int64))>,
E strict=True,
E name=None,
E ordered=False,
E unique_column_names=False,
E metadata=None,
E add_missing_columns=False
E )> failed element-wise validator 0:
E <Check col_a_less_than_col_b>
E failure cases:
E column index failure_case
E 0 p_idx 0 0
E 1 p_idx 1 1
E 2 p_idx 2 2
E 3 p_idx 3 3
E 4 time_idx 0 1507
E .. ... ... ...
E 63 pb_left 3 10278.000332
E 64 pb_right 0 9932.0
E 65 pb_right 1 2018.912258
E 66 pb_right 2 9376.814997
E 67 pb_right 3 11755.987964
E
E [68 rows x 3 columns]
Which gives me no information on how the check failed. Ideally I would return a wide frame restricted to the rows and columns that failed the check. In my example, the columns checked by the registered check were 'whh_width' and 'pb_width', and I am expecting 'whh_width' to be less than 'pb_width'. On failure, I need the error report to contain the following at a bare minimum to understand the problem:
whh_width pb_width col_a_less_than_col_b
0 9244.513493 9244.513493 False 1 282.912258 282.912258 False 2 2921.814997 2921.814997 False 3 1477.987631 1477.987631 False
Which can be produced with:
fail_cases = schema_error.failure_cases.pivot(index='index', columns = 'column', values='failure_case')[schema_error.check.statistics.values()]
check_result = schema_error.check_output
check_result.name = schema_error.check.name
report = pd.concat([fail_cases, check_result],axis=1)
within the context of SchemaErrorHandler
. No idea how scalable that is though.
How does this style of error messaging look (inspired by polars). Planning on limiting output to 10 rows (can be configured maybe?)
Schema MySchema: A total of 2 schema errors were found.
âââââââââŦââââââââââŦâââââââââââââââââââââââŦâââââââââââââââŦâââââââââââââââââŦâââââââââââââââ
â index â column â check â failure_case â schema_context â check_number â
âââââââââĒââââââââââĒâââââââââââââââââââââââĒâââââââââââââââĒâââââââââââââââââĒâââââââââââââââĄ
â 0 â flavour â isin(['coke', '7up', â pepsi â Column â 0 â
â â â 'mountain_dew']) â â â â
â 1 â flavour â isin(['coke', '7up', â fanta â Column â 0 â
â â â 'mountain_dew']) â â â â
âââââââââ´ââââââââââ´âââââââââââââââââââââââ´âââââââââââââââ´âââââââââââââââââ´âââââââââââââââ
To be confirmed:
- How this works with custom or wide checks (such as when a column in not in the schema so
failure_case
is NA) - How this looks with verbose column names or check names
How does this style of error messaging look (inspired by polars). Planning on limiting output to 10 rows (can be configured maybe?)
Schema MySchema: A total of 2 schema errors were found. âââââââââŦââââââââââŦâââââââââââââââââââââââŦâââââââââââââââŦâââââââââââââââââŦâââââââââââââââ â index â column â check â failure_case â schema_context â check_number â âââââââââĒââââââââââĒâââââââââââââââââââââââĒâââââââââââââââĒâââââââââââââââââĒââââââââââââââ⥠â 0 â flavour â isin(['coke', '7up', â pepsi â Column â 0 â â â â 'mountain_dew']) â â â â â 1 â flavour â isin(['coke', '7up', â fanta â Column â 0 â â â â 'mountain_dew']) â â â â âââââââââ´ââââââââââ´âââââââââââââââââââââââ´âââââââââââââââ´âââââââââââââââââ´âââââââââââââââ
To be confirmed:
How this works with custom or wide checks (such as when a column in not in the schema so
failure_case
is NA)How this looks with verbose column names or check names
thats exactly what I had in mind. I think for custom checks you could mask the check_obj with the boolean series returned by the check to present the failure cases. In the case of a dataframe custom check I think a False should be enough, instead relying on a descriptive check name that is answered by the bool.
The above PR is merged @jonathanstathakis @austinzhang1018 if you want to take a look! It's not what was specified in the comments, but follows a more readable 'report' format building on what was implemented for pyspark schemas. A future contribution could add a table or something, but best to wait for some feedback to see what users think of the new format maybe?
Cutting a 0.18.1
release this weekend to capture these changes