pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Enhanced Validation Reporting for PySpark DataFrames in Pandera

Open zaheerabbas-prodigal opened this issue 11 months ago • 13 comments

Is your feature request related to a problem? Please describe. Hello, I am new to Pyspark and data engineering in general. I am looking to validate a Pyspark Dataframe given a schema. Came across pandera, which suits my needs the best.

Currently, as I understand, because of the distributed nature of spark. All pandera invalidation errors are compiled and are generated in error report of pandera. But these error reports do not seem to show which records are invalid.

I am looking for a way to run validation of pyspark dataframe through pandera and get indices of invalid rows so that I can post process or dump into a different corrupt records database if that makes sense or atleast drop invalid rows.

There is a feature of Drop Invalid Rows - but only supports pandas for now if my understanding is correct. When I add a drop_invalid_rows it throws me BackendNotFound exception. I also came across this stackoverflow response but this also is only supported for pandas I believe.

Describe the solution you'd like Get the indices of invalid rows after running Pandera.validate() function

Describe alternatives you've considered Drop invalid rows in a pyspark dataframe that do not match schema defined. This is supported for pandas but not for pyspark dataframe as per my experimentation.

Additional context NA

zaheerabbas-prodigal avatar Mar 26 '24 14:03 zaheerabbas-prodigal

We are also interested in this enhancement

amulgund-kr avatar Mar 28 '24 19:03 amulgund-kr

We are waiting on this enhancement for PYSPark data frame to filter invalid records after schema validation

pyarlagadda-kr avatar Mar 28 '24 19:03 pyarlagadda-kr

I am also interested in this enhancement

Niveth7922 avatar Mar 29 '24 02:03 Niveth7922

I am interested as well!!!!

acestes-kr avatar Apr 01 '24 17:04 acestes-kr

This seems like a useful feature! I'd support this effort. @NeerajMalhotra-QB @jaskaransinghsidana @filipeo2-mck any thoughts on this? It would incur significant compute cost, since right now the pyspark checks queries are limited to the first invalid value: https://github.com/unionai-oss/pandera/blob/main/pandera/backends/pyspark/builtin_checks.py

A couple of thoughts:

  1. Does this justify additional value in the PANDERA_VALIDATION_DEPTH configuration, like DATA_ALL, or a new configuration setting like PANDERA_FULL_TABLE_SCAN=True.
  2. We may want to add more comprehensive logging to let users know how validation is being performed.
  3. For very large spark dataframes would a sample or first n failure cases make sense, or do folks want to see all failure cases?

cosmicBboy avatar Apr 13 '24 16:04 cosmicBboy

This seems like a useful feature! I'd support this effort. @NeerajMalhotra-QB @jaskaransinghsidana @filipeo2-mck any thoughts on this? It would incur significant compute cost, since right now the pyspark checks queries are limited to the first invalid value: https://github.com/unionai-oss/pandera/blob/main/pandera/backends/pyspark/builtin_checks.py

A couple of thoughts:

  1. Does this justify additional value in the PANDERA_VALIDATION_DEPTH configuration, like DATA_ALL, or a new configuration setting like PANDERA_FULL_TABLE_SCAN=True.
  2. We may want to add more comprehensive logging to let users know how validation is being performed.
  3. For very large spark dataframes would a sample or first n failure cases make sense, or do folks want to see all failure cases?
  1. Yeah I have also seen some folks checking for this feature too, it make sense to actually implement this functionality, but given it would be expensive operation it could add some documentation which clearly list out that this operations would have performance implications.
  2. I am not sure exactly what do you want to capture for this one, but comprehensive logging is always developer friendly..
  3. This is something quite debatable, we debated on this topic quite a bit internally too not related to pandera but for any data quality in general too, but the answer is varied per team/product maturity/cost requirements quite honestly with no consensus.. Some team wants 0 bad records no matter the cost+time implemplications but others have some % threshold that are willing to accept. I would personally rather have configurable with where I can accept top N if I want to do a sample but also have something value like "-1" which would mean full, let user make this decision which is right for them.

jaskaransinghsidana avatar Apr 14 '24 12:04 jaskaransinghsidana

  1. Yep, I agree it would be useful. Currently we just count the DF to identify failing checks (1 spark action per check). If a new option to filter out invalid rows is set, more spark actions would be triggered per check. It would be a more expensive operation but that can be diminished by caching the PySpark DF before applying the validations (through export PANDERA_CACHE_DATAFRAME=True).

filipeo2-mck avatar Apr 15 '24 10:04 filipeo2-mck

Okay, so high level steps for this issue:

  1. Introduce PANDERA_FULL_TABLE_VALIDATION configuration. By default it should be None and should be set depending on the validation backend. It should be True for the pandas check backend but False for the pyspark backend.
  2. Modify all of the pyspark builtin checks to have two execution modes:
    • PANDERA_FULL_TABLE_VALIDATION=False is the current behavior
    • PANDERA_FULL_TABLE_VALIDATION=True should return a boolean column indicating which element in the column passed the check
  3. Make any additional changes to the pyspark backend to support a boolean column as the output of a check (we can take inspiration from the polars check backend on how to do this).
  4. Add support for the drop_invalid_rows option
  5. Add info logging at validation time to let the user know if full table validation is happening or not
  6. Add documentation discussing the performance implications of turning on full table validation.

To avoid further complexity we shouldn't support randomly sampling data values, we either do full table validation or single value validation (the current pyspark behavior)

cosmicBboy avatar Apr 15 '24 19:04 cosmicBboy

Happy to help review and take a PR over the finish line if someone wants to take this task on @zaheerabbas-prodigal

cosmicBboy avatar Apr 15 '24 19:04 cosmicBboy

Thanks for laying out the high-level details needed for this feature @cosmicBboy, will be happy to take this up 😄.

zaheerabbas21 avatar Apr 28 '24 21:04 zaheerabbas21

Thanks for laying out the high-level details needed for this feature @cosmicBboy, will be happy to take this up 😄.

Hey, this feature would be really, really nice. I can easily sell my Databricks team on Pandera if this were to be implemented. Are you still working on this, or should I not get my hopes up?

vovavili avatar Sep 18 '24 01:09 vovavili

@vovavili and everyone on this thread. Have been busy the past months. But look forward to picking this up soon in the next week or so. Will update here soon.

zaheerabbas21 avatar Sep 27 '24 09:09 zaheerabbas21

Awesome, thanks!

cosmicBboy avatar Sep 27 '24 13:09 cosmicBboy