pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Add support for dropping invalid rows for pyspark backend

Open zaheerabbas21 opened this issue 1 year ago • 8 comments
trafficstars

Solves issue - https://github.com/unionai-oss/pandera/issues/1540

Tasks to be completed as per this comment:

  • [x] Introduce PANDERA_FULL_TABLE_VALIDATION configuration. By default, it should be None and should be set depending on the validation backend. It should be True for the pandas check backend but False for the pyspark backend.
  • [x] Modify all of the pyspark builtin checks to have two execution modes:
    • PANDERA_FULL_TABLE_VALIDATION=False is the current behavior
    • PANDERA_FULL_TABLE_VALIDATION=True should return a boolean column indicating which element in the column passed the check.
  • [ ] Make any additional changes to the pyspark backend to support a boolean column as the output of a check (we can take inspiration from the polars check backend on how to do this).
  • [ ] Add support for the drop_invalid_rows option
  • [ ] Add info logging at validation time to let the user know if full table validation is happening or not
  • [ ] Add documentation discussing the performance implications of turning on full table validation.
  • [ ] Add unit test cases in the testing pipeline to support the PANDERA_FULL_TABLE_VALIDATION config and drop_invalid_rows option

PS: New to the repo 😄 , so please call out if I am not following repo guidelines or code style. Appreciate your help!

zaheerabbas21 avatar May 12 '24 20:05 zaheerabbas21

thanks @nk4456542, this is awesome!

Looks like some of the tests are broken

=========================== short test summary info ============================
FAILED tests/core/test_pandas_config.py::TestPandasDataFrameConfig::test_disable_validation - AssertionError: assert {'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>, 'cache_dataframe': False, 'keep_cached_dataframe': False, 'full_table_validation': None} == {'cache_dataframe': False, 'keep_cached_dataframe': False, 'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>}

see https://github.com/unionai-oss/pandera/actions/runs/9054025532/job/24909236981?pr=1639.

You can run these tests locally with pytest tests/pyspark.

cosmicBboy avatar May 14 '24 01:05 cosmicBboy

Was a bit busy for the past two weeks, will continue working on this from this week

zaheerabbas21 avatar Jun 02 '24 17:06 zaheerabbas21

hi @nk4456542 friendly ping on progress here, let me know if you need any help!

cosmicBboy avatar Jun 28 '24 16:06 cosmicBboy

@cosmicBboy - Apologies for dropping this, will pick this up this week. I work at a startup 😅, so I had my work cut out for one of the feature launches.

I will contact you in the comments if I need help on this PR.

zaheerabbas21 avatar Jun 29 '24 12:06 zaheerabbas21

thanks for the update @nk4456542, totally understand what it's like to be at a startup 👍

cosmicBboy avatar Jun 30 '24 17:06 cosmicBboy

I have been caught up in work again 😞 . But would really like to work on this 😬 , would update here again when I can pick up this again.

Apologies again for not being clear on the timelines

zaheerabbas21 avatar Jul 09 '24 08:07 zaheerabbas21

Codecov Report

Attention: Patch coverage is 5.63380% with 67 lines in your changes missing coverage. Please review.

Project coverage is 74.00%. Comparing base (812b2a8) to head (86a34a4). Report is 141 commits behind head on main.

Files with missing lines Patch % Lines
pandera/backends/pyspark/builtin_checks.py 0.00% 57 Missing :warning:
pandera/backends/pyspark/utils.py 0.00% 9 Missing :warning:
pandera/config.py 80.00% 1 Missing :warning:

:exclamation: There is a different number of reports uploaded between BASE (812b2a8) and HEAD (86a34a4). Click for more details.

HEAD has 7 uploads less than BASE
Flag BASE (812b2a8) HEAD (86a34a4)
48 41
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1639       +/-   ##
===========================================
- Coverage   94.28%   74.00%   -20.28%     
===========================================
  Files          91      120       +29     
  Lines        7013     9190     +2177     
===========================================
+ Hits         6612     6801      +189     
- Misses        401     2389     +1988     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Sep 01 '24 19:09 codecov[bot]