spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default

Open wangyum opened this issue 2 years ago • 11 comments

What changes were proposed in this pull request?

This PR enable spark.sql.cbo.enabled by default.

Why are the changes needed?

  1. Enable CBO to get better performance, we've enabled it over 3 years.
  2. Benchmark related tests also enabled it:https://github.com/apache/spark/blob/e83f8a872a16d4f049cefb1fc445f91cf84443ad/sql/core/src/test/scala/org/apache/spark/sql/TPCBase.scala#L32

Does this PR introduce any user-facing change?

I don't think this PR could introduce any user-facing changes, we've fixed related bugs before, for example: https://github.com/apache/spark/pull/29894.

How was this patch tested?

Fix existing tests.

wangyum avatar Jul 12 '22 06:07 wangyum

cc @cloud-fan

wangyum avatar Jul 12 '22 06:07 wangyum

cc @sigmod @maryannxue @viirya

cloud-fan avatar Jul 12 '22 08:07 cloud-fan

cc @sunchao , @viirya , @huaxingao

dongjoon-hyun avatar Jul 12 '22 16:07 dongjoon-hyun

@wangyum do you have any numbers on the performance gain from this?

sunchao avatar Jul 12 '22 16:07 sunchao

@wangyum do you have any numbers on the performance gain from this?

Will update the benchmark results later.

wangyum avatar Jul 13 '22 02:07 wangyum

@viirya Here is an example. AQE can be converted to broadcast join, but the performance is worse than directly planned to broadcast join.

import org.apache.spark.benchmark.Benchmark
val numRows = 1024 * 1024 * 25
spark.sql(s"CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b, id AS c, id AS d, id AS e FROM range(${numRows}L)")
spark.sql(s"CREATE TABLE t2 using parquet AS SELECT id AS a, id AS b, id AS c, id AS d, id AS e FROM range(${numRows}L)")

spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")

val benchmark = new Benchmark(s"Benchmark SPARK-39750", numRows, minNumIters = 5)
Seq(false, true).foreach { enabled =>
  val name = s"CBO is ${if (enabled) "Enabled" else "Disabled"}"
  benchmark.addCase(name) { _ =>
    withSQLConf("spark.sql.cbo.enabled" -> s"$enabled") {
      spark.sql(s"SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t2.b < 10000").write.format("noop").mode("Overwrite").save()
    }
  }
}
benchmark.run()
Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Benchmark SPARK-39750:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
CBO is Disabled                                    5277           5570         195          5.0         201.3       1.0X
CBO is Enabled                                     1378           1582         166         19.0          52.6       3.8X
Default Enable CBO
image image

wangyum avatar Jul 13 '22 02:07 wangyum

TPC-DS q47:

Default Enable CBO
image image

wangyum avatar Jul 14 '22 09:07 wangyum

TPC-DS q84:

Default Enable CBO
image image

wangyum avatar Jul 14 '22 09:07 wangyum

Could you add some description about some requirements or assumptions. For example, what happens when the statistics are incorrect, @wangyum ?

dongjoon-hyun avatar Jul 14 '22 20:07 dongjoon-hyun

This is a great move, thanks @wangyum for starting this! I think CBO and AQE could work well together, with CBO handles query without shuffle in a better way, and AQE handles query with shuffle in a better way. With more accurate AQE statistics, CBO optimizer could also do a better job for query planning.

But I shared similar question with @dongjoon-hyun, when the statistics is out of date to become inaccurate, current CBO has no way to check it AFAIK. It would be great if we can add more check / work around for it in 3.4 release to either (1).make statistics up-to-date, or (2).automatically disable CBO if statistics is stale.

c21 avatar Jul 24 '22 09:07 c21

+CC @zoelin7, @xkrogen FYI

mridulm avatar Jul 25 '22 07:07 mridulm

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Nov 03 '22 00:11 github-actions[bot]