pandera Add support for dropping invalid rows for pyspark backend

trafficstars

Solves issue - https://github.com/unionai-oss/pandera/issues/1540

Tasks to be completed as per this comment:

[x] Introduce PANDERA_FULL_TABLE_VALIDATION configuration. By default, it should be None and should be set depending on the validation backend. It should be True for the pandas check backend but False for the pyspark backend.
[x] Modify all of the pyspark builtin checks to have two execution modes:
- PANDERA_FULL_TABLE_VALIDATION=False is the current behavior
- PANDERA_FULL_TABLE_VALIDATION=True should return a boolean column indicating which element in the column passed the check.
[ ] Make any additional changes to the pyspark backend to support a boolean column as the output of a check (we can take inspiration from the polars check backend on how to do this).
[ ] Add support for the drop_invalid_rows option
[ ] Add info logging at validation time to let the user know if full table validation is happening or not
[ ] Add documentation discussing the performance implications of turning on full table validation.
[ ] Add unit test cases in the testing pipeline to support the PANDERA_FULL_TABLE_VALIDATION config and drop_invalid_rows option

PS: New to the repo 😄 , so please call out if I am not following repo guidelines or code style. Appreciate your help!

May 12 '24 20:05 zaheerabbas21

thanks @nk4456542, this is awesome!

Looks like some of the tests are broken

=========================== short test summary info ============================
FAILED tests/core/test_pandas_config.py::TestPandasDataFrameConfig::test_disable_validation - AssertionError: assert {'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>, 'cache_dataframe': False, 'keep_cached_dataframe': False, 'full_table_validation': None} == {'cache_dataframe': False, 'keep_cached_dataframe': False, 'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>}

see https://github.com/unionai-oss/pandera/actions/runs/9054025532/job/24909236981?pr=1639.

You can run these tests locally with pytest tests/pyspark.

May 14 '24 01:05 cosmicBboy

Was a bit busy for the past two weeks, will continue working on this from this week

Jun 02 '24 17:06 zaheerabbas21

hi @nk4456542 friendly ping on progress here, let me know if you need any help!

Jun 28 '24 16:06 cosmicBboy

@cosmicBboy - Apologies for dropping this, will pick this up this week. I work at a startup 😅, so I had my work cut out for one of the feature launches.

I will contact you in the comments if I need help on this PR.

Jun 29 '24 12:06 zaheerabbas21

thanks for the update @nk4456542, totally understand what it's like to be at a startup 👍

Jun 30 '24 17:06 cosmicBboy

I have been caught up in work again 😞 . But would really like to work on this 😬 , would update here again when I can pick up this again.

Apologies again for not being clear on the timelines

Jul 09 '24 08:07 zaheerabbas21

Codecov Report

Attention: Patch coverage is 5.63380% with 67 lines in your changes missing coverage. Please review.

Project coverage is 74.00%. Comparing base (812b2a8) to head (86a34a4). Report is 141 commits behind head on main.

Files with missing lines	Patch %	Lines
pandera/backends/pyspark/builtin_checks.py	0.00%	57 Missing :warning:
pandera/backends/pyspark/utils.py	0.00%	9 Missing :warning:
pandera/config.py	80.00%	1 Missing :warning:

:exclamation: There is a different number of reports uploaded between BASE (812b2a8) and HEAD (86a34a4). Click for more details.

HEAD has 7 uploads less than BASE

Flag BASE (812b2a8) HEAD (86a34a4)

48 41

Flag	BASE (812b2a8)	HEAD (86a34a4)
	48	41

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1639       +/-   ##
===========================================
- Coverage   94.28%   74.00%   -20.28%     
===========================================
  Files          91      120       +29     
  Lines        7013     9190     +2177     
===========================================
+ Hits         6612     6801      +189     
- Misses        401     2389     +1988

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Sep 01 '24 19:09 codecov[bot]

pandera pandera copied to clipboard

Add support for dropping invalid rows for pyspark backend

Codecov Report

pandera
pandera copied to clipboard