datafusion [DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench

Is your feature request related to a problem or challenge?

I am mostly writing this up to record what I think is an ongoing work with @jayzhan211 @Rachelint @korowa and myself

TLDR, we are working on (and getting pretty close) to having DataFusion be the fastest single node engine for querying parquet files in ClickBench

Background:

https://benchmark.clickhouse.com/ shows the results of ClickBench

ClickBench the benchmark and is described here https://github.com/ClickHouse/ClickBench. I am not personally interested in proprietary file formats that require special loading

Here is the current leaderboard for partitioned parquet reflecting DataFusion 40.0.0:

Screenshot 2024-10-08 at 4 45 16 PM

Describe the solution you'd like

I would like DataFusion to be the fastest

Describe alternatives you've considered

No response

Additional context

This is also inspired by @ozankabak 's call to action on #11442

The scripts to run with datafusion are here: https://github.com/ClickHouse/ClickBench/tree/main/datafusion

Last update is here: https://github.com/ClickHouse/ClickBench/pull/210

Oct 08 '24 20:10 alamb

Changes I think will make these queries significantly faster:

[x] https://github.com/apache/datafusion/pull/11627 - @korowa
[x] https://github.com/apache/datafusion/pull/12269 - @jayzhan211 @Rachelint
[x] https://github.com/apache/datafusion/issues/11682 - team effort (we are close)
[ ] @Rachelint also has another potential ~10% faster with https://github.com/apache/datafusion/pull/11943
[ ] https://github.com/apache/datafusion/issues/3463

These optimizations are general purpose, not specific to Clickhouse I don't think

Oct 08 '24 20:10 alamb

Reuse hash for repartition #12526 and avoid copy in coalesce #7957 could probably also provide some improvement

Oct 09 '24 07:10 jayzhan211

Nice!

I think one bigger future interesting direction would be further vectorization of core hash aggregate algorithm (i.e. treating matches as candidates and doing e.g. equality checks in a vectorized way to allow for more specialization / more efficient code).

Oct 09 '24 07:10 Dandandan

🤔 As reviewing #12697 , seems we can still continue to improve partial skipping? Now we can modify threshold to get performance improvement, but it may be a bit tricky?

And I think maybe we can make clearer about when partial can help, and when partial will even get slower?

Oct 09 '24 09:10 Rachelint

And I think maybe we can make clearer about when partial can help, and when partial will even get slower?

In my mind the challenge with tweaking the "switch to partial mode" threshold setting is that some queries will likely get faster and some will likely get slower. If we can justify changing the default setting to some different constant I think it will be fine. However, if we are going to add more complex logic to decide when to switch modes in my opinion it needs to be significantly better than a static threshold (where significantly means "always better" or close to it)

Oct 09 '24 21:10 alamb

And I think maybe we can make clearer about when partial can help, and when partial will even get slower?

In my mind the challenge with tweaking the "switch to partial mode" threshold setting is that some queries will likely get faster and some will likely get slower. If we can justify changing the default setting to some different constant I think it will be fine. However, if we are going to add more complex logic to decide when to switch modes in my opinion it needs to be significantly better than a static threshold (where significantly means "always better" or close to it)

Got it, @jayzhan211 have tried some other values of skip_partial_aggregation_probe_ratio_threshold and skip_partial_aggregation_probe_rows_threshold, some queries seems improve obviously in #12697

And I have some thoughs like removing the is_locked field?

Now, we take skip_partial_aggregation_probe_rows_threshold as a sample to define if we need to skip, when exceed we will not check this again). But I found some partial operator can get improvement from skipping, but have no chance to switch to due to is_locked.

Oct 11 '24 02:10 Rachelint

https://github.com/apache/datafusion/pull/12697#discussion_r1789659808 Only Q0 slows down, but given it has nothing to do with grouping, I think we can ignore it.

This number is run on another branch that only change the configuration value, so I think another approach is to remove skip_partial_aggregation_probe_rows_threshold and related logic entirely and set skip_partial_aggregation_probe_ratio_threshold to 0.1.

Oct 11 '24 03:10 jayzhan211

I think one bigger future interesting direction would be further vectorization of core hash aggregate algorithm

Can we use nightly rust that enable std::simd for vectorization? Although in arrow-rs, the simd code is rewritten with auto-vectorization, but when I check the generated asm, I didn't see vector instruction for all the function (some exists, some doesn't). I think it would be nice to have explicitly simd to ensure the code is always vectorized and not disappear because of the code change or the llvm change.

Oct 12 '24 11:10 jayzhan211

@jayzhan211 Yeah, this sounds like a good idea. We could start stepping into a direction to make the execution engine as performant as Velox. Especially having arrow be the format should allow us to maximize our use of vectorized execution. Should I open an issue for this?

Oct 12 '24 17:10 jonathanc-n

Can we use nightly rust that enable std::simd for vectorization? Although in arrow-rs, the simd code is rewritten with auto-vectorization, but when I check the generated asm, I didn't see vector instruction for all the function (some exists, some doesn't). I think it would be nice to have explicitly simd to ensure the code is always vectorized and not disappear because of the code change or the llvm change.

I think @tustvold found that using manually written simd kernels is quite hard to get faster than the auto vectorized code (aka using the vector instructions) made by LLVM and also harder to maintain

If possible I would suggest we instead focus on improving the code so that LLVM is better able to auto vectorize code. This is some combination of looking at the resulting assembly code, and then making the inner loops simpler (e.g. via #[inline] and removing bounds checks get_unchecked, special cases for not checking Option, etc)

Oct 13 '24 12:10 alamb

I found that LLVM is relatively good at vectorizing vertical operations provided:

There are no conditionals within the loop body
You've been careful to avoid inlining too much, as the vectorizer gives up if the code is too complex
You aren't doing bitwise horizontal reductions or masking (although FWIW std::simd struggles with this as well)
You've enabled SIMD instructions in the target ISA

This last point is likely why you aren't seeing anything, the default x86 ISA is over a decade old at this point and doesn't support pretty much any SIMD instructions. See the Performance Tips section at the end of - https://crates.io/crates/arrow

My 2 cents is to get as far as you can without reaching for std::simd, there is a massive maintainance overhead and with care LLVM can produce code that performs better than naively written manual SIMD. We used to have a fair bit of manual SIMD in arrow-rs, and over time we've removed it as the auto-vectorized code was faster.

I'd recommend getting familiar with tools like https://rust.godbolt.org/ (again being sure to set RUSTFLAGS) and only once you've exhausted that avenue think of reaching for SIMD. Generally the hard part is getting the algorithm structured in such a way that it can be vectorised, regardless of what goes and generates those instructions.

Oct 13 '24 12:10 tustvold

Thank you @tustvold -- that content is so good I made a PR to propose putting it in the readme of arrow-rs: https://github.com/apache/arrow-rs/pull/6554

Oct 13 '24 13:10 alamb

After a few more PRs for StringView I think we are pretty close: https://github.com/apache/datafusion/pull/12092#issuecomment-2410952786

I'll try and run the numbers at some point to compare to duckdb, but DataFusion is certainly quite a bit faster than 40.0.0 now and will be even more so once we complete the StringView work

Oct 15 '24 10:10 alamb

StringView by default is finally merged into DataFusion: https://github.com/apache/datafusion/pull/13101

Oct 30 '24 18:10 alamb

@Rachelint has another non trivial group by performance improvement that is very close: https://github.com/apache/datafusion/pull/12996

Nov 04 '24 18:11 alamb

Update here: the results from @pmcgleenon are looking really nice: https://github.com/apache/datafusion/issues/13099#issuecomment-2478314793

386562203-8029e9c7-e6d3-4e7e-8273-725472aeeeb9

Also, BTW, 43.0.0 doesn't include the work from @Rachelint that will likely improve things a few more percent overall (substantially for some queries):

@Rachelint has another non trivial group by performance improvement that is very close: #12996

Nov 15 '24 10:11 alamb

Love this 🚀 🚀 🚀

Nov 15 '24 11:11 ozankabak

Here is a fun challenge:

https://github.com/apache/datafusion/issues/13448

Nov 16 '24 12:11 alamb

While there is definitely more we can do to improve performance, for now I am going to claim we are done here.

The blog is live: https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench

🚀

Nov 22 '24 15:11 alamb