deequ icon indicating copy to clipboard operation
deequ copied to clipboard

Improving performance of histogram analyzer on 150 columns

Open eapframework opened this issue 5 years ago • 5 comments

Hi,

I am trying to perform Histogram analyzer on 150 columns by appending columns from list. Code is working fine, but it is taking nearly an hour to run.

    val columnName = List("col1","col2","col3","col4",,,,,"col149","col150")
    var analysisResult = AnalysisRunner.onData(dataFrame)
    for (column <- columnName) {
          analysisResult = analysisResult.addAnalyzer(Histogram(column)) }
    val metricsResult = { analysisResult.run() }

Can you please help optimize the performance? I am running with 50 executors, 3 cores per executor, and spark.sql.shuffle.partitions = 150.

Is it possible to load each column in separate executor to improve performance. I feel the record shuffle among executors is reducing the performance.

Thanks!

eapframework avatar Oct 04 '20 06:10 eapframework

The problem is that the histogram analyzer needs to compute the exact count for each bucket, so there is no way to avoid shuffling.

sscdotopen avatar Oct 07 '20 08:10 sscdotopen

I am facing this issue too, as the number of columns go beyond 100 the performance deteriorates. Can we build a histogram analyser that uses approx distinct for faster calculation ? @sscdotopen

utkarshshukla2912 avatar Oct 08 '20 06:10 utkarshshukla2912

That should be possible, would you like to work on that?

sscdotopen avatar Oct 08 '20 07:10 sscdotopen

@sscdotopen yes would want to contribute on this.

utkarshshukla2912 avatar Oct 21 '20 08:10 utkarshshukla2912

What do you guys feel about batching total columns by 100 columns in each batch. has worked well for us. @sscdotopen @utkarshshukla2912

academy-codex avatar May 26 '21 06:05 academy-codex