deequ Performance impact when trying to generate profiling report for more than 200 columns

Encountering performance issues when generating a profiling report for more than 200 columns across 5 million records. I am applying almost all the metrics to generate profiling report. Applied metrics such as datatype, entropy, minimum, maximum, sum, standard deviation, mean, maxlength, minlength, histogram, completeness, distinctness, uniquevalueratio, uniqueness, countdistinct, and correlation. I am trying to generate report similar to ydata-profiling(https://github.com/ydataai/ydata-profiling)

The job has been running for over 3 hours despite attempts to optimize Spark configuration. When checking the logs each metrics is calculated sequentially. Sequential computation of each metric is causing the prolonged runtime. Is it possible to parallelize this operation for improved efficiency?

Feb 16 '24 19:02 eapframework

Thanks for the feedback @eapframework We will investigate this issue and get back to you with an update.

Mar 07 '24 22:03 rdsharma26

Hi @rdsharma26, I was doing more testing. By analyzing the spark execution tasks, I believe the performance issue is because for metrics such as CountDistinct, Histogram, each metrics calculation is done on each column in sequential manner. So more columns in dataframe is causing the job to run for more time. Parallelizing these calculations would enhance efficiency.

Apr 12 '24 19:04 eapframework