datafusion-comet [EPIC] Improve performance of TPC-H queries

[EPIC] Improve performance of TPC-H queries

Open andygrove opened this issue 9 months ago • 9 comments

This epic is for tracking progress on improving performance of Comet with our benchmarks derived from TPC-H.

Comet is 1.6x faster than Spark
Comet is not as fast as other DataFusion subprojects yet
All of these DataFusion subprojects are performing similar native execution, which indicates that there is room to improve on Comet's current performance

We do not run all queries fully natively yet due to these missing features:

[ ] Scans are sometimes slower due to dictionary encoding or decoding, and it may be better if we can defer this until later in the query, but this is not really possible at the moment because DataFusion requires that all batches with a stream have the same physical type, so we cannot match Utf and Dictionary<Utf8> for example
[ ] CometExchange is sometimes slower than Spark's exchange even though it reads and writes less data.

Most of these queries are already faster with Comet enabled. Here are notes on areas where performance could potentially be improved.

q1
- https://github.com/apache/datafusion-comet/issues/942
q2
- Some scans are slower, partly due to dictionary unpacking cost
- Some exchanges are slower (could this also be due to premature unpacking of dictionaries?)
q3
q4
q5
- #846
q6
q7
- #846
q8
- #846
q9
- 82% of the time is in SortExec + SortMergeJoinExec. Ballista uses a HashJoinExec.
q10
q11
q12
q13
q14
- lineitem scans take 2x longer in Comet, but this is offset by avoiding an expensive C2R. The time for native decoding in Comet is longer than the entire scan in Spark.
- #942
q15
q16
- #457
q17
- #398
q18
q19
- #398
q20
- #846
- #398
q21
- #846
- #398
q22

May 06 '24 21:05 andygrove