deequ Improving performance of histogram analyzer on 150 columns

Hi,

I am trying to perform Histogram analyzer on 150 columns by appending columns from list. Code is working fine, but it is taking nearly an hour to run.

    val columnName = List("col1","col2","col3","col4",,,,,"col149","col150")
    var analysisResult = AnalysisRunner.onData(dataFrame)
    for (column <- columnName) {
          analysisResult = analysisResult.addAnalyzer(Histogram(column)) }
    val metricsResult = { analysisResult.run() }

Can you please help optimize the performance? I am running with 50 executors, 3 cores per executor, and spark.sql.shuffle.partitions = 150.

Is it possible to load each column in separate executor to improve performance. I feel the record shuffle among executors is reducing the performance.

Thanks!

Oct 04 '20 06:10 eapframework

The problem is that the histogram analyzer needs to compute the exact count for each bucket, so there is no way to avoid shuffling.

Oct 07 '20 08:10 sscdotopen

I am facing this issue too, as the number of columns go beyond 100 the performance deteriorates. Can we build a histogram analyser that uses approx distinct for faster calculation ? @sscdotopen

Oct 08 '20 06:10 utkarshshukla2912

That should be possible, would you like to work on that?

Oct 08 '20 07:10 sscdotopen

@sscdotopen yes would want to contribute on this.

Oct 21 '20 08:10 utkarshshukla2912

What do you guys feel about batching total columns by 100 columns in each batch. has worked well for us. @sscdotopen @utkarshshukla2912

May 26 '21 06:05 academy-codex