datafusion Range/inequality joins are slow

Describe the bug

Joins where the ON filter are not equality, but rather inequalities like <, `> etc. seem slow. Atleast compared to DuckDB which seem like a direct "competitor".

The main difference between the DuckDB and Datafusion plans seem to be that Datafusion uses a NestedLoopJoinExec, while DuckDB uses a IEJoin.

Note that the query could be written better with a ASOF-join, but Datafusion does not support that (see issue https://github.com/apache/arrow-datafusion/issues/318).

To Reproduce

Create some test data with this SQL (saved as repro-dataset.sql) in DuckDB:

CREATE
OR REPLACE TABLE pricing AS
SELECT
    t,
    RANDOM() as v
FROM
    range(
        '2022-01-01' :: TIMESTAMP,
        '2023-01-01' :: TIMESTAMP,
        INTERVAL 30 DAY
    ) ts(t);

COPY pricing to 'pricing.parquet' (format 'parquet');

CREATE
OR REPLACE TABLE timestamps AS
SELECT
    t
FROM
    range(
        '2022-01-01' :: TIMESTAMP,
        '2023-01-01' :: TIMESTAMP,
        INTERVAL 10 SECOND
    ) ts(t);

COPY timestamps to 'timestamps.parquet' (format 'parquet');

$ duckdb < repro-dataset.sql

We will compare the performance of the following query in DuckDB and Datafusion. The query is saved as repro-range-query.sql.

WITH pricing_state AS (
    SELECT
        t as valid_from,
        COALESCE(
            LEAD(t, 1) OVER (
                ORDER BY
                    t
            ),
            '9999-12-31'
        ) as valid_to,
        v
    FROM
        'pricing.parquet'
)
SELECT
    t.t,
    p.v
FROM
    pricing_state p
    LEFT JOIN 'timestamps.parquet' t ON t.t BETWEEN p.valid_from
    AND p.valid_to;

DuckDB performance:

$ time duckdb < repro-range-query.sql
...
real    0m0.999s
user    0m6.070s
sys     0m3.600s

Datafusion performance:

$ time datafusion-cli -f repro-range-query.sql
...
real    0m8.269s
user    0m6.358s
sys     0m1.907s

Expected behavior

It would be nice if the above query (or something equivalent) would be faster in Datafusion.

If someone knows of a better way to express the query, then that could also be a workaround for me.

Additional context

Machine tested on: CPU:Ryzen 3900x OS: Ubuntu 22.04

Versions used:

$ duckdb --version
v0.9.2 3c695d7ba9

$ datafusion-cli --version
datafusion-cli 33.0.0

Dec 01 '23 12:12 simonvandel

I just noticed that what I really want is to actually do a RIGHT join. That is, if there is no matching pricing for a timestamp, it should give null.

Changing the query to that, Datafusion is much faster. I believe it's because with a RIGHT join, pricing becomes the outer table (single partition), while timestamps becomes the inner table (unspecified partitioning), which allows for greater parallelism (see https://github.com/apache/arrow-datafusion/blob/e19c669855baa8b78ff86755803944d2ddf65536/datafusion/physical-plan/src/joins/nested_loop_join.rs#L72-L77C4)

But I think the issue should still be open - the LEFT join is still slower

Dec 01 '23 15:12 simonvandel

I think IEJoin is a form of RangeJoin (https://duckdb.org/2022/05/27/iejoin.html) -- I agree it would be neat to make this fast in DataFusion, but I think it is a pretty major project (it typically requires a specialized operator, as described in the DuckDB blog)

Dec 01 '23 20:12 alamb

I stared trying to collect a list of various join improvments on https://github.com/apache/arrow-datafusion/issues/8398

Dec 01 '23 20:12 alamb

I am interested in this ticket. Since it is a pretty major project, I will write a proposal first.

Mar 07 '24 05:03 my-vegetable-has-exploded

Thank you @my-vegetable-has-exploded -- that is a great idea

cc @korowa / @viirya / @metesynnada who have been involved in Join implementations recently and who may be interested as well

Mar 08 '24 20:03 alamb

Disregarding IEJoin -- time output from the issue description seems to show that both DuckDB and DF spend +- same cputime (user + system) and the only difference is parallelism (shown by real time), which, how @simonvandel noticed, depends on left/right input + join type) -- this makes me think that the x8 slowdown is not related to how join performed internally, but more like caused by physical optimizer skipping join reordering for NestedLoopJoin.

So, if i'm not mistaken, this issue is mostly about covering NLJoin in join_selection.rs.

UPD: in addition, to make join reordering useful, it's also required to modify NLJoin, since currently it chooses build-side based on logical join type.

Mar 11 '24 19:03 korowa

So, if i'm not mistaken, this issue is mostly about covering NLJoin in join_selection.rs.

I think it is a good idea to improve performance in this scenario. Your pr is also good for me. But I think it is also ok to keep old parallelism strategy. In my opinion, the old paralleism strategy should works, but the check in enforce_distribution.rs block the reparition of it whick would check the row number. In this query, pricing_state 's row numbers is less than batch_size and the RepartitionExec also just works for a batch a time.

https://github.com/apache/arrow-datafusion/blob/ad8d552b9f150c3c066b0764e84f72b667a649ff/datafusion/core/src/physical_optimizer/enforce_distribution.rs#L1099-L1106

I think it may another way to write a new enforce_distribution strategy for NestLoopJoin and CrossJoin. We can check repartition_beneficial_stats by the left table size multiply right partition size rather than just right partition size (take RIGHTJOIN for example).

Mar 20 '24 02:03 my-vegetable-has-exploded

the old paralleism strategy should works, but the check in enforce_distribution.rs block the reparition

I don't think it's proper way to go -- it'll give some benefits in terms of runtime, but it will be suboptimal in terms of memory utilization, and cputime (as we'll need to perform BuildSideRows * NumberOfPartitions filter evaluations instead of BuildSideRows * 1, where 1 is probe side input batches)

Mar 20 '24 18:03 korowa

I don't think this issue should be closed.

#9676 seems to take care of ordering but I think it doesn't improve range/inequality joins much?

Apr 22 '24 18:04 Dandandan

My intention was to fix NLJoin parallelism issue due to fixed build-side choice (since right join instead of left had acceptable performance, as it was claimed above), and in the same time we also have #318 for specialized operator implementation, so, I supposed #9676 to be enough.

Don't mind to keep it open, though.

Apr 23 '24 17:04 korowa

datafusion datafusion copied to clipboard

Range/inequality joins are slow

Describe the bug

To Reproduce

Expected behavior

Additional context

datafusion
datafusion copied to clipboard