datafusion
datafusion copied to clipboard
Range/inequality joins are slow
Describe the bug
Joins where the ON
filter are not equality, but rather inequalities like <
, `> etc. seem slow. Atleast compared to DuckDB which seem like a direct "competitor".
The main difference between the DuckDB and Datafusion plans seem to be that Datafusion uses a NestedLoopJoinExec
, while DuckDB uses a IEJoin
.
Note that the query could be written better with a ASOF-join, but Datafusion does not support that (see issue https://github.com/apache/arrow-datafusion/issues/318).
To Reproduce
Create some test data with this SQL (saved as repro-dataset.sql) in DuckDB:
CREATE
OR REPLACE TABLE pricing AS
SELECT
t,
RANDOM() as v
FROM
range(
'2022-01-01' :: TIMESTAMP,
'2023-01-01' :: TIMESTAMP,
INTERVAL 30 DAY
) ts(t);
COPY pricing to 'pricing.parquet' (format 'parquet');
CREATE
OR REPLACE TABLE timestamps AS
SELECT
t
FROM
range(
'2022-01-01' :: TIMESTAMP,
'2023-01-01' :: TIMESTAMP,
INTERVAL 10 SECOND
) ts(t);
COPY timestamps to 'timestamps.parquet' (format 'parquet');
$ duckdb < repro-dataset.sql
We will compare the performance of the following query in DuckDB and Datafusion. The query is saved as repro-range-query.sql
.
WITH pricing_state AS (
SELECT
t as valid_from,
COALESCE(
LEAD(t, 1) OVER (
ORDER BY
t
),
'9999-12-31'
) as valid_to,
v
FROM
'pricing.parquet'
)
SELECT
t.t,
p.v
FROM
pricing_state p
LEFT JOIN 'timestamps.parquet' t ON t.t BETWEEN p.valid_from
AND p.valid_to;
DuckDB performance:
$ time duckdb < repro-range-query.sql
...
real 0m0.999s
user 0m6.070s
sys 0m3.600s
Datafusion performance:
$ time datafusion-cli -f repro-range-query.sql
...
real 0m8.269s
user 0m6.358s
sys 0m1.907s
Expected behavior
It would be nice if the above query (or something equivalent) would be faster in Datafusion.
If someone knows of a better way to express the query, then that could also be a workaround for me.
Additional context
Machine tested on: CPU:Ryzen 3900x OS: Ubuntu 22.04
Versions used:
$ duckdb --version
v0.9.2 3c695d7ba9
$ datafusion-cli --version
datafusion-cli 33.0.0
I just noticed that what I really want is to actually do a RIGHT join. That is, if there is no matching pricing for a timestamp, it should give null.
Changing the query to that, Datafusion is much faster. I believe it's because with a RIGHT join, pricing becomes the outer table (single partition), while timestamps becomes the inner table (unspecified partitioning), which allows for greater parallelism (see https://github.com/apache/arrow-datafusion/blob/e19c669855baa8b78ff86755803944d2ddf65536/datafusion/physical-plan/src/joins/nested_loop_join.rs#L72-L77C4)
But I think the issue should still be open - the LEFT join is still slower
I think IEJoin
is a form of RangeJoin (https://duckdb.org/2022/05/27/iejoin.html) -- I agree it would be neat to make this fast in DataFusion, but I think it is a pretty major project (it typically requires a specialized operator, as described in the DuckDB blog)
I stared trying to collect a list of various join improvments on https://github.com/apache/arrow-datafusion/issues/8398
I am interested in this ticket. Since it is a pretty major project, I will write a proposal first.
Thank you @my-vegetable-has-exploded -- that is a great idea
cc @korowa / @viirya / @metesynnada who have been involved in Join implementations recently and who may be interested as well
Disregarding IEJoin -- time
output from the issue description seems to show that both DuckDB and DF spend +- same cputime (user + system) and the only difference is parallelism (shown by real time), which, how @simonvandel noticed, depends on left/right input + join type) -- this makes me think that the x8 slowdown is not related to how join performed internally, but more like caused by physical optimizer skipping join reordering for NestedLoopJoin
.
So, if i'm not mistaken, this issue is mostly about covering NLJoin in join_selection.rs.
UPD: in addition, to make join reordering useful, it's also required to modify NLJoin, since currently it chooses build-side based on logical join type.
So, if i'm not mistaken, this issue is mostly about covering NLJoin in join_selection.rs.
I think it is a good idea to improve performance in this scenario. Your pr is also good for me. But I think it is also ok to keep old parallelism strategy. In my opinion, the old paralleism strategy should works, but the check in enforce_distribution.rs
block the reparition of it whick would check the row number. In this query, pricing_state
's row numbers is less than batch_size and the RepartitionExec
also just works for a batch a time.
https://github.com/apache/arrow-datafusion/blob/ad8d552b9f150c3c066b0764e84f72b667a649ff/datafusion/core/src/physical_optimizer/enforce_distribution.rs#L1099-L1106
I think it may another way to write a new enforce_distribution strategy for NestLoopJoin
and CrossJoin
. We can check repartition_beneficial_stats
by the left table size multiply right partition size rather than just right partition size (take RIGHTJOIN for example).
the old paralleism strategy should works, but the check in enforce_distribution.rs block the reparition
I don't think it's proper way to go -- it'll give some benefits in terms of runtime, but it will be suboptimal in terms of memory utilization, and cputime (as we'll need to perform BuildSideRows * NumberOfPartitions filter evaluations instead of BuildSideRows * 1, where 1 is probe side input batches)
I don't think this issue should be closed.
#9676 seems to take care of ordering but I think it doesn't improve range/inequality joins much?
My intention was to fix NLJoin parallelism issue due to fixed build-side choice (since right join instead of left had acceptable performance, as it was claimed above), and in the same time we also have #318 for specialized operator implementation, so, I supposed #9676 to be enough.
Don't mind to keep it open, though.