spark [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default

What changes were proposed in this pull request?

This PR enable spark.sql.cbo.enabled by default.

Why are the changes needed?

Enable CBO to get better performance, we've enabled it over 3 years.
Benchmark related tests also enabled it:https://github.com/apache/spark/blob/e83f8a872a16d4f049cefb1fc445f91cf84443ad/sql/core/src/test/scala/org/apache/spark/sql/TPCBase.scala#L32

Does this PR introduce any user-facing change?

I don't think this PR could introduce any user-facing changes, we've fixed related bugs before, for example: https://github.com/apache/spark/pull/29894.

How was this patch tested?

Fix existing tests.

Jul 12 '22 06:07 wangyum

cc @cloud-fan

Jul 12 '22 06:07 wangyum

cc @sigmod @maryannxue @viirya

Jul 12 '22 08:07 cloud-fan

cc @sunchao , @viirya , @huaxingao

Jul 12 '22 16:07 dongjoon-hyun

@wangyum do you have any numbers on the performance gain from this?

Jul 12 '22 16:07 sunchao

@wangyum do you have any numbers on the performance gain from this?

Will update the benchmark results later.

Jul 13 '22 02:07 wangyum

@viirya Here is an example. AQE can be converted to broadcast join, but the performance is worse than directly planned to broadcast join.

import org.apache.spark.benchmark.Benchmark
val numRows = 1024 * 1024 * 25
spark.sql(s"CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b, id AS c, id AS d, id AS e FROM range(${numRows}L)")
spark.sql(s"CREATE TABLE t2 using parquet AS SELECT id AS a, id AS b, id AS c, id AS d, id AS e FROM range(${numRows}L)")

spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")

val benchmark = new Benchmark(s"Benchmark SPARK-39750", numRows, minNumIters = 5)
Seq(false, true).foreach { enabled =>
  val name = s"CBO is ${if (enabled) "Enabled" else "Disabled"}"
  benchmark.addCase(name) { _ =>
    withSQLConf("spark.sql.cbo.enabled" -> s"$enabled") {
      spark.sql(s"SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t2.b < 10000").write.format("noop").mode("Overwrite").save()
    }
  }
}
benchmark.run()

Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Benchmark SPARK-39750:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
CBO is Disabled                                    5277           5570         195          5.0         201.3       1.0X
CBO is Enabled                                     1378           1582         166         19.0          52.6       3.8X

Default	Enable CBO

Jul 13 '22 02:07 wangyum

TPC-DS q47:

Default	Enable CBO

Jul 14 '22 09:07 wangyum

TPC-DS q84:

Default	Enable CBO

Jul 14 '22 09:07 wangyum

Could you add some description about some requirements or assumptions. For example, what happens when the statistics are incorrect, @wangyum ?

Jul 14 '22 20:07 dongjoon-hyun

This is a great move, thanks @wangyum for starting this! I think CBO and AQE could work well together, with CBO handles query without shuffle in a better way, and AQE handles query with shuffle in a better way. With more accurate AQE statistics, CBO optimizer could also do a better job for query planning.

But I shared similar question with @dongjoon-hyun, when the statistics is out of date to become inaccurate, current CBO has no way to check it AFAIK. It would be great if we can add more check / work around for it in 3.4 release to either (1).make statistics up-to-date, or (2).automatically disable CBO if statistics is stale.

Jul 24 '22 09:07 c21

+CC @zoelin7, @xkrogen FYI

Jul 25 '22 07:07 mridulm

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Nov 03 '22 00:11 github-actions[bot]

spark spark copied to clipboard

[SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard