spark
spark copied to clipboard
[SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default
What changes were proposed in this pull request?
This PR enable spark.sql.cbo.enabled
by default.
Why are the changes needed?
- Enable CBO to get better performance, we've enabled it over 3 years.
- Benchmark related tests also enabled it:https://github.com/apache/spark/blob/e83f8a872a16d4f049cefb1fc445f91cf84443ad/sql/core/src/test/scala/org/apache/spark/sql/TPCBase.scala#L32
Does this PR introduce any user-facing change?
I don't think this PR could introduce any user-facing changes, we've fixed related bugs before, for example: https://github.com/apache/spark/pull/29894.
How was this patch tested?
Fix existing tests.
cc @cloud-fan
cc @sigmod @maryannxue @viirya
cc @sunchao , @viirya , @huaxingao
@wangyum do you have any numbers on the performance gain from this?
@wangyum do you have any numbers on the performance gain from this?
Will update the benchmark results later.
@viirya Here is an example. AQE can be converted to broadcast join, but the performance is worse than directly planned to broadcast join.
import org.apache.spark.benchmark.Benchmark
val numRows = 1024 * 1024 * 25
spark.sql(s"CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b, id AS c, id AS d, id AS e FROM range(${numRows}L)")
spark.sql(s"CREATE TABLE t2 using parquet AS SELECT id AS a, id AS b, id AS c, id AS d, id AS e FROM range(${numRows}L)")
spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
val benchmark = new Benchmark(s"Benchmark SPARK-39750", numRows, minNumIters = 5)
Seq(false, true).foreach { enabled =>
val name = s"CBO is ${if (enabled) "Enabled" else "Disabled"}"
benchmark.addCase(name) { _ =>
withSQLConf("spark.sql.cbo.enabled" -> s"$enabled") {
spark.sql(s"SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t2.b < 10000").write.format("noop").mode("Overwrite").save()
}
}
}
benchmark.run()
Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Benchmark SPARK-39750: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
CBO is Disabled 5277 5570 195 5.0 201.3 1.0X
CBO is Enabled 1378 1582 166 19.0 52.6 3.8X
Default | Enable CBO |
---|---|
![]() |
![]() |
TPC-DS q47:
Default | Enable CBO |
---|---|
![]() |
![]() |
TPC-DS q84:
Default | Enable CBO |
---|---|
![]() |
![]() |
Could you add some description about some requirements or assumptions. For example, what happens when the statistics are incorrect, @wangyum ?
This is a great move, thanks @wangyum for starting this! I think CBO and AQE could work well together, with CBO handles query without shuffle in a better way, and AQE handles query with shuffle in a better way. With more accurate AQE statistics, CBO optimizer could also do a better job for query planning.
But I shared similar question with @dongjoon-hyun, when the statistics is out of date to become inaccurate, current CBO has no way to check it AFAIK. It would be great if we can add more check / work around for it in 3.4 release to either (1).make statistics up-to-date, or (2).automatically disable CBO if statistics is stale.
+CC @zoelin7, @xkrogen FYI
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!