pandera
pandera copied to clipboard
Feature/922 add other ways to report unique errors as an argument
This implements feature request #902 by allowing the unique parameter to take in more arguments, like "first", "last", "all". These control how unique values are reported (all unique values? all except for the first occurrence? all except for the last occurrence?).
Codecov Report
Base: 96.55% // Head: 96.56% // Increases project coverage by +0.01% :tada:
Coverage data is based on head (
d115973) compared to base (d6538a5). Patch coverage: 100.00% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## dev #914 +/- ##
==========================================
+ Coverage 96.55% 96.56% +0.01%
==========================================
Files 43 43
Lines 4179 4198 +19
==========================================
+ Hits 4035 4054 +19
Misses 144 144
| Impacted Files | Coverage Δ | |
|---|---|---|
| pandera/strategies.py | 98.25% <ø> (ø) |
|
| pandera/dtypes.py | 94.64% <100.00%> (+0.04%) |
:arrow_up: |
| pandera/model_components.py | 95.61% <100.00%> (+0.07%) |
:arrow_up: |
| pandera/schema_components.py | 99.07% <100.00%> (+<0.01%) |
:arrow_up: |
| pandera/schemas.py | 99.26% <100.00%> (+0.01%) |
:arrow_up: |
| pandera/decorators.py | 98.80% <0.00%> (ø) |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
thanks @ng-henry ! this will be the last PR before the 0.12.0 release.
Please take a look at https://github.com/unionai-oss/pandera/pull/913, which is a WIP PR for overhauling pandera internals to be more extensible.
I think the other features you'd like to support (like partitioning coercible and uncoercible cells) will be easier to reason about and implement with the new internals https://github.com/unionai-oss/pandera/issues/906, although I think there's still a bit of design work to figure out how to implement it.
@ng-henry after considering this PR a little more, I think conflating the unique property with the keep argument is confusing. In pandera, options should intuitively map to properties that are validated at runtime.
I think it would make more sense to either:
- introduce a new kwarg like
report_duplicatesor something like this, where options are "all", "exclude_first", "exclude_last" to make it super clear what's happening. - only have
series.duplicated(keep=False)to report all duplicates... I might be missing something, but I think there's a strong use case for identifying all duplicates and a pretty weak use case for only reporting all except the first/last duplicate value.
We can merge this as part of the 0.13.0 release, before the major 0.14.0 overhaul of the backend.
What do you think?
Yep I think that's a good idea. BTW we are not setting keep = False when using unique = True on the Column constructor. We are only doing that in the DataFrameSchema constructor. See issue #902 for reference.
I'll just continue work on the report_duplicates argument on this branch, and make a new branch to fix the duplicate issue when using the Column constructor.
@cosmicBboy codecov says there's zero coverage on pandera/mypy.py but not quite sure why that is. I don't think I changed anything on the mypy side of things.
codecov says there's zero coverage on pandera/mypy.py but not quite sure why that is. I don't think I changed anything on the mypy side of things.
That's not a problem.
introduce a new kwarg like report_duplicates or something like this, where options are "all", "exclude_first", "exclude_last" to make it super clear what's happening.
Can we stick to this naming? I think unique_keep_setting is too pandas-specific (in a way that's confusing). report_duplicates with the "all", "exclude_first", "exclude_last" options are a lot clearer in what's happening: pandera will report "all" duplicates, "exclude the first" or "exclude the last" duplicate.
@cosmicBboy implemented but codecov is lower. Do you want to add more tests?
awesome, thanks @ng-henry !