WIP: Fix performance regression with `stddev` being enabled by default
Which issue does this PR close?
Closes #.
Rationale for this change
Fix a performance regression and simplify configs for enabling operators and expressions
We now fall back to Spark for stddev_sample aggregates.
What changes are included in this PR?
stddevis now disabled by default. There was a recent regression related to configs that had enabled this by default, and this operation is much slower in DataFusion than in Spark.- Remove
spark.comet.exec.all.enabledand enable all operators by default. Each operator can be disabled individually be changing it's enabled config to false. These are all documented.
How are these changes tested?
Existing tests
Codecov Report
Attention: Patch coverage is 53.01205% with 39 lines in your changes missing coverage. Please review.
Project coverage is 34.04%. Comparing base (
c9af3f4) to head (a868355). Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #840 +/- ##
============================================
- Coverage 34.05% 34.04% -0.01%
+ Complexity 879 861 -18
============================================
Files 112 112
Lines 42959 42912 -47
Branches 9488 9473 -15
============================================
- Hits 14629 14609 -20
+ Misses 25330 25312 -18
+ Partials 3000 2991 -9
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@parthchandra Could I get your opinion on the config changes in this PR? I'm not sure this is the best approach but it was previously not possible to have a default value of false for spark.comet.exec.OPNAME.enabled because it would be enabled anyway if all execs were enabled.
I'm not sure this is the best approach but it was previously not possible to have a default value of
falseforspark.comet.exec.OPNAME.enabledbecause it would be enabled anyway if all execs were enabled.
Would it be better to have have an exclusion list (does not have to be a config, could just be an internal list) which is explicitly checked if EXEC_ALL is enabled? So op is enabled if ( (EXEC_ALL && not in EXEC_ALL_EXCLUSIONS) || op is enabled) ? We can include an op in exclusions if we feel it is not ready and it can be enabled explicitly for development/testing.
There is an upstream PR in DataFusion to improve stddev performance https://github.com/apache/datafusion/pull/12095
I am closing this issue because we can disable this expression via a config now that https://github.com/apache/datafusion-comet/pull/855 is merged