datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

WIP: Fix performance regression with `stddev` being enabled by default

Open andygrove opened this issue 1 year ago • 1 comments

Which issue does this PR close?

Closes #.

Rationale for this change

Fix a performance regression and simplify configs for enabling operators and expressions

tpcds_allqueries

We now fall back to Spark for stddev_sample aggregates.

image

What changes are included in this PR?

  • stddev is now disabled by default. There was a recent regression related to configs that had enabled this by default, and this operation is much slower in DataFusion than in Spark.
  • Remove spark.comet.exec.all.enabled and enable all operators by default. Each operator can be disabled individually be changing it's enabled config to false. These are all documented.

How are these changes tested?

Existing tests

andygrove avatar Aug 18 '24 13:08 andygrove

Codecov Report

Attention: Patch coverage is 53.01205% with 39 lines in your changes missing coverage. Please review.

Project coverage is 34.04%. Comparing base (c9af3f4) to head (a868355). Report is 1 commits behind head on main.

Files Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 53.06% 8 Missing and 15 partials :warning:
...org/apache/comet/CometSparkSessionExtensions.scala 0.00% 1 Missing and 15 partials :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #840      +/-   ##
============================================
- Coverage     34.05%   34.04%   -0.01%     
+ Complexity      879      861      -18     
============================================
  Files           112      112              
  Lines         42959    42912      -47     
  Branches       9488     9473      -15     
============================================
- Hits          14629    14609      -20     
+ Misses        25330    25312      -18     
+ Partials       3000     2991       -9     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Aug 18 '24 18:08 codecov-commenter

@parthchandra Could I get your opinion on the config changes in this PR? I'm not sure this is the best approach but it was previously not possible to have a default value of false for spark.comet.exec.OPNAME.enabled because it would be enabled anyway if all execs were enabled.

andygrove avatar Aug 19 '24 13:08 andygrove

I'm not sure this is the best approach but it was previously not possible to have a default value of false for spark.comet.exec.OPNAME.enabled because it would be enabled anyway if all execs were enabled.

Would it be better to have have an exclusion list (does not have to be a config, could just be an internal list) which is explicitly checked if EXEC_ALL is enabled? So op is enabled if ( (EXEC_ALL && not in EXEC_ALL_EXCLUSIONS) || op is enabled) ? We can include an op in exclusions if we feel it is not ready and it can be enabled explicitly for development/testing.

parthchandra avatar Aug 19 '24 16:08 parthchandra

There is an upstream PR in DataFusion to improve stddev performance https://github.com/apache/datafusion/pull/12095

andygrove avatar Aug 21 '24 23:08 andygrove

I am closing this issue because we can disable this expression via a config now that https://github.com/apache/datafusion-comet/pull/855 is merged

andygrove avatar Aug 25 '24 15:08 andygrove