pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Feature/922 add other ways to report unique errors as an argument

Open ng-henry opened this issue 3 years ago • 8 comments

This implements feature request #902 by allowing the unique parameter to take in more arguments, like "first", "last", "all". These control how unique values are reported (all unique values? all except for the first occurrence? all except for the last occurrence?).

ng-henry avatar Aug 12 '22 22:08 ng-henry

Codecov Report

Base: 96.55% // Head: 96.56% // Increases project coverage by +0.01% :tada:

Coverage data is based on head (d115973) compared to base (d6538a5). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #914      +/-   ##
==========================================
+ Coverage   96.55%   96.56%   +0.01%     
==========================================
  Files          43       43              
  Lines        4179     4198      +19     
==========================================
+ Hits         4035     4054      +19     
  Misses        144      144              
Impacted Files Coverage Δ
pandera/strategies.py 98.25% <ø> (ø)
pandera/dtypes.py 94.64% <100.00%> (+0.04%) :arrow_up:
pandera/model_components.py 95.61% <100.00%> (+0.07%) :arrow_up:
pandera/schema_components.py 99.07% <100.00%> (+<0.01%) :arrow_up:
pandera/schemas.py 99.26% <100.00%> (+0.01%) :arrow_up:
pandera/decorators.py 98.80% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Aug 12 '22 23:08 codecov[bot]

thanks @ng-henry ! this will be the last PR before the 0.12.0 release.

Please take a look at https://github.com/unionai-oss/pandera/pull/913, which is a WIP PR for overhauling pandera internals to be more extensible.

I think the other features you'd like to support (like partitioning coercible and uncoercible cells) will be easier to reason about and implement with the new internals https://github.com/unionai-oss/pandera/issues/906, although I think there's still a bit of design work to figure out how to implement it.

cosmicBboy avatar Aug 15 '22 18:08 cosmicBboy

@ng-henry after considering this PR a little more, I think conflating the unique property with the keep argument is confusing. In pandera, options should intuitively map to properties that are validated at runtime.

I think it would make more sense to either:

  1. introduce a new kwarg like report_duplicates or something like this, where options are "all", "exclude_first", "exclude_last" to make it super clear what's happening.
  2. only have series.duplicated(keep=False) to report all duplicates... I might be missing something, but I think there's a strong use case for identifying all duplicates and a pretty weak use case for only reporting all except the first/last duplicate value.

We can merge this as part of the 0.13.0 release, before the major 0.14.0 overhaul of the backend.

What do you think?

cosmicBboy avatar Aug 16 '22 16:08 cosmicBboy

Yep I think that's a good idea. BTW we are not setting keep = False when using unique = True on the Column constructor. We are only doing that in the DataFrameSchema constructor. See issue #902 for reference.

ng-henry avatar Aug 17 '22 04:08 ng-henry

I'll just continue work on the report_duplicates argument on this branch, and make a new branch to fix the duplicate issue when using the Column constructor.

ng-henry avatar Aug 17 '22 04:08 ng-henry

@cosmicBboy codecov says there's zero coverage on pandera/mypy.py but not quite sure why that is. I don't think I changed anything on the mypy side of things.

ng-henry avatar Aug 20 '22 15:08 ng-henry

codecov says there's zero coverage on pandera/mypy.py but not quite sure why that is. I don't think I changed anything on the mypy side of things.

That's not a problem.

introduce a new kwarg like report_duplicates or something like this, where options are "all", "exclude_first", "exclude_last" to make it super clear what's happening.

Can we stick to this naming? I think unique_keep_setting is too pandas-specific (in a way that's confusing). report_duplicates with the "all", "exclude_first", "exclude_last" options are a lot clearer in what's happening: pandera will report "all" duplicates, "exclude the first" or "exclude the last" duplicate.

cosmicBboy avatar Aug 22 '22 20:08 cosmicBboy

@cosmicBboy implemented but codecov is lower. Do you want to add more tests?

ng-henry avatar Aug 26 '22 21:08 ng-henry

awesome, thanks @ng-henry !

cosmicBboy avatar Sep 22 '22 03:09 cosmicBboy