[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench
Is your feature request related to a problem or challenge?
I am mostly writing this up to record what I think is an ongoing work with @jayzhan211 @Rachelint @korowa and myself
TLDR, we are working on (and getting pretty close) to having DataFusion be the fastest single node engine for querying parquet files in ClickBench
Background:
https://benchmark.clickhouse.com/ shows the results of ClickBench
ClickBench the benchmark and is described here https://github.com/ClickHouse/ClickBench. I am not personally interested in proprietary file formats that require special loading
Here is the current leaderboard for partitioned parquet reflecting DataFusion 40.0.0:
Describe the solution you'd like
I would like DataFusion to be the fastest
Describe alternatives you've considered
No response
Additional context
This is also inspired by @ozankabak 's call to action on #11442
The scripts to run with datafusion are here: https://github.com/ClickHouse/ClickBench/tree/main/datafusion
Last update is here: https://github.com/ClickHouse/ClickBench/pull/210
Changes I think will make these queries significantly faster:
- [x] https://github.com/apache/datafusion/pull/11627 - @korowa
- [x] https://github.com/apache/datafusion/pull/12269 - @jayzhan211 @Rachelint
- [x] https://github.com/apache/datafusion/issues/11682 - team effort (we are close)
- [ ] @Rachelint also has another potential ~10% faster with https://github.com/apache/datafusion/pull/11943
- [ ] https://github.com/apache/datafusion/issues/3463
These optimizations are general purpose, not specific to Clickhouse I don't think
Reuse hash for repartition #12526 and avoid copy in coalesce #7957 could probably also provide some improvement
Nice!
I think one bigger future interesting direction would be further vectorization of core hash aggregate algorithm (i.e. treating matches as candidates and doing e.g. equality checks in a vectorized way to allow for more specialization / more efficient code).
🤔 As reviewing #12697 , seems we can still continue to improve partial skipping? Now we can modify threshold to get performance improvement, but it may be a bit tricky?
And I think maybe we can make clearer about when partial can help, and when partial will even get slower?
And I think maybe we can make clearer about when partial can help, and when partial will even get slower?
In my mind the challenge with tweaking the "switch to partial mode" threshold setting is that some queries will likely get faster and some will likely get slower. If we can justify changing the default setting to some different constant I think it will be fine. However, if we are going to add more complex logic to decide when to switch modes in my opinion it needs to be significantly better than a static threshold (where significantly means "always better" or close to it)
And I think maybe we can make clearer about when partial can help, and when partial will even get slower?
In my mind the challenge with tweaking the "switch to partial mode" threshold setting is that some queries will likely get faster and some will likely get slower. If we can justify changing the default setting to some different constant I think it will be fine. However, if we are going to add more complex logic to decide when to switch modes in my opinion it needs to be significantly better than a static threshold (where significantly means "always better" or close to it)
Got it, @jayzhan211 have tried some other values of skip_partial_aggregation_probe_ratio_threshold and skip_partial_aggregation_probe_rows_threshold, some queries seems improve obviously in #12697
And I have some thoughs like removing the is_locked field?
Now, we take skip_partial_aggregation_probe_rows_threshold as a sample to define if we need to skip, when exceed we will not check this again).
But I found some partial operator can get improvement from skipping, but have no chance to switch to due to is_locked.
https://github.com/apache/datafusion/pull/12697#discussion_r1789659808 Only Q0 slows down, but given it has nothing to do with grouping, I think we can ignore it.
This number is run on another branch that only change the configuration value, so I think another approach is to remove skip_partial_aggregation_probe_rows_threshold and related logic entirely and set skip_partial_aggregation_probe_ratio_threshold to 0.1.
I think one bigger future interesting direction would be further vectorization of core hash aggregate algorithm
Can we use nightly rust that enable std::simd for vectorization? Although in arrow-rs, the simd code is rewritten with auto-vectorization, but when I check the generated asm, I didn't see vector instruction for all the function (some exists, some doesn't). I think it would be nice to have explicitly simd to ensure the code is always vectorized and not disappear because of the code change or the llvm change.
@jayzhan211 Yeah, this sounds like a good idea. We could start stepping into a direction to make the execution engine as performant as Velox. Especially having arrow be the format should allow us to maximize our use of vectorized execution. Should I open an issue for this?
Can we use nightly rust that enable std::simd for vectorization? Although in arrow-rs, the simd code is rewritten with auto-vectorization, but when I check the generated asm, I didn't see vector instruction for all the function (some exists, some doesn't). I think it would be nice to have explicitly simd to ensure the code is always vectorized and not disappear because of the code change or the llvm change.
I think @tustvold found that using manually written simd kernels is quite hard to get faster than the auto vectorized code (aka using the vector instructions) made by LLVM and also harder to maintain
If possible I would suggest we instead focus on improving the code so that LLVM is better able to auto vectorize code. This is some combination of looking at the resulting assembly code, and then making the inner loops simpler (e.g. via #[inline] and removing bounds checks get_unchecked, special cases for not checking Option, etc)
I found that LLVM is relatively good at vectorizing vertical operations provided:
- There are no conditionals within the loop body
- You've been careful to avoid inlining too much, as the vectorizer gives up if the code is too complex
- You aren't doing bitwise horizontal reductions or masking (although FWIW std::simd struggles with this as well)
- You've enabled SIMD instructions in the target ISA
This last point is likely why you aren't seeing anything, the default x86 ISA is over a decade old at this point and doesn't support pretty much any SIMD instructions. See the Performance Tips section at the end of - https://crates.io/crates/arrow
My 2 cents is to get as far as you can without reaching for std::simd, there is a massive maintainance overhead and with care LLVM can produce code that performs better than naively written manual SIMD. We used to have a fair bit of manual SIMD in arrow-rs, and over time we've removed it as the auto-vectorized code was faster.
I'd recommend getting familiar with tools like https://rust.godbolt.org/ (again being sure to set RUSTFLAGS) and only once you've exhausted that avenue think of reaching for SIMD. Generally the hard part is getting the algorithm structured in such a way that it can be vectorised, regardless of what goes and generates those instructions.
Thank you @tustvold -- that content is so good I made a PR to propose putting it in the readme of arrow-rs: https://github.com/apache/arrow-rs/pull/6554
After a few more PRs for StringView I think we are pretty close: https://github.com/apache/datafusion/pull/12092#issuecomment-2410952786
I'll try and run the numbers at some point to compare to duckdb, but DataFusion is certainly quite a bit faster than 40.0.0 now and will be even more so once we complete the StringView work
StringView by default is finally merged into DataFusion: https://github.com/apache/datafusion/pull/13101
@Rachelint has another non trivial group by performance improvement that is very close: https://github.com/apache/datafusion/pull/12996
Update here: the results from @pmcgleenon are looking really nice: https://github.com/apache/datafusion/issues/13099#issuecomment-2478314793
Also, BTW, 43.0.0 doesn't include the work from @Rachelint that will likely improve things a few more percent overall (substantially for some queries):
@Rachelint has another non trivial group by performance improvement that is very close: #12996
Love this 🚀 🚀 🚀
Here is a fun challenge:
- https://github.com/apache/datafusion/issues/13448
While there is definitely more we can do to improve performance, for now I am going to claim we are done here.
The blog is live: https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench
🚀