pandera
pandera copied to clipboard
Improve dataframe-wide checks and their error reporting (failure_cases)
Is your feature request related to a problem? Please describe. Whenever a wide check fails, the generated SchemaErrors' failure_cases contains one record per column of the offending row.
This behavior is not incorrect per se, as there really is no way to deduce which columns are implicated without tracking how the check_obj is being accessed inside the check_fn.
However, such behavior is not helpful. At a glance, it tends to indicate that much more data than expected fails to validate. Moreover, from what I've seen online while searching for a solution, a good bunch of people were confused by this behavior (and probably don't bother to say so when they finally realize what's going on).
Basically, one would probably expect only one record per row that fails the wide check (n.b : this is not even the case for element-wise wide checks), and my understanding is that doing so would require wildly changing the format of failure_cases. In my opinion, completely separating failure_cases with DataFrameSchema contexts would not be an adequate solution, as it would needlessly complicate the SchemaErrors structure, or require a rework of the way errors are reported.
Describe the solution you'd like
- Provide a way for specifying which columns are involved in a specific dataframe-wide check (most probably at DataFrameSchema level).
- I would probably rename "checks" to "wide_checks", since it's a widely accepted term
- Allow "checks"/"wide_checks" to be a dict[Check, Iterable[str]], mapping checks to the involved columns
- Provide the same functionality to the dataframe check decorator
- In the resulting failure_cases, suppress the inclusion of records pertaining to columns which not marked as involved in the specific check. I don't know how hard this would be, since my exploration of the relevant code left me daunted.
- For completeness' sake, there should probably still be a way to get the current failure_case report format even without specified columns
- Clearly specify both wide check reporting behaviors (with and without specifying involved columns at DataFrameSchema level)
if this feature request is deemed to be of too little value, at the very least please explicitly explain the current behavior in the docs
Describe alternatives you've considered
- Manually tracking which columns are involved in which checks
- Manually manipulating the failure_cases report to discard irrelevant data
Additional context yes i'm lazy
Agreed errors messages could be improved @splatpope see my issue: https://github.com/unionai-oss/pandera/issues/1276
@cosmicBboy could it be worth jumping on a quick call at some point to outline some enhancement PRs around errors?
I am running into this. I agree that the reporting is really not helpful in the case of a wide check. I'd rely on my check to provide that information back. Additionally, scoping could apply to wide checks, which may request only a subset of columns. This would be good to narrow down the assumption that the check would be accessing every column in the dataframe per row, and further improve reporting.