pandera
pandera copied to clipboard
Add support for dropping invalid rows for pyspark backend
Solves issue - https://github.com/unionai-oss/pandera/issues/1540
Tasks to be completed as per this comment:
- [x] Introduce
PANDERA_FULL_TABLE_VALIDATIONconfiguration. By default, it should be None and should be set depending on the validation backend. It should be True for the pandas check backend but False for the pyspark backend. - [x] Modify all of the pyspark builtin checks to have two execution modes:
PANDERA_FULL_TABLE_VALIDATION=Falseis the current behaviorPANDERA_FULL_TABLE_VALIDATION=Trueshould return a boolean column indicating which element in the column passed the check.
- [ ] Make any additional changes to the pyspark backend to support a boolean column as the output of a check (we can take inspiration from the polars check backend on how to do this).
- [ ] Add support for the
drop_invalid_rowsoption - [ ] Add info logging at validation time to let the user know if full table validation is happening or not
- [ ] Add documentation discussing the performance implications of turning on full table validation.
- [ ] Add unit test cases in the testing pipeline to support the
PANDERA_FULL_TABLE_VALIDATIONconfig anddrop_invalid_rowsoption
PS: New to the repo 😄 , so please call out if I am not following repo guidelines or code style. Appreciate your help!
thanks @nk4456542, this is awesome!
Looks like some of the tests are broken
=========================== short test summary info ============================
FAILED tests/core/test_pandas_config.py::TestPandasDataFrameConfig::test_disable_validation - AssertionError: assert {'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>, 'cache_dataframe': False, 'keep_cached_dataframe': False, 'full_table_validation': None} == {'cache_dataframe': False, 'keep_cached_dataframe': False, 'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>}
see https://github.com/unionai-oss/pandera/actions/runs/9054025532/job/24909236981?pr=1639.
You can run these tests locally with pytest tests/pyspark.
Was a bit busy for the past two weeks, will continue working on this from this week
hi @nk4456542 friendly ping on progress here, let me know if you need any help!
@cosmicBboy - Apologies for dropping this, will pick this up this week. I work at a startup 😅, so I had my work cut out for one of the feature launches.
I will contact you in the comments if I need help on this PR.
thanks for the update @nk4456542, totally understand what it's like to be at a startup 👍
I have been caught up in work again 😞 . But would really like to work on this 😬 , would update here again when I can pick up this again.
Apologies again for not being clear on the timelines
Codecov Report
Attention: Patch coverage is 5.63380% with 67 lines in your changes missing coverage. Please review.
Project coverage is 74.00%. Comparing base (
812b2a8) to head (86a34a4). Report is 141 commits behind head on main.
:exclamation: There is a different number of reports uploaded between BASE (812b2a8) and HEAD (86a34a4). Click for more details.
HEAD has 7 uploads less than BASE
Flag BASE (812b2a8) HEAD (86a34a4) 48 41
Additional details and impacted files
@@ Coverage Diff @@
## main #1639 +/- ##
===========================================
- Coverage 94.28% 74.00% -20.28%
===========================================
Files 91 120 +29
Lines 7013 9190 +2177
===========================================
+ Hits 6612 6801 +189
- Misses 401 2389 +1988
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.