DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

add a keyword to allow specifying target row order in joins

Open bkamins opened this issue 3 years ago • 10 comments

This issue is meant to gather requirements for all join* operations regarding the resulting order of rows after the join operation. After we finish the discussion I will implement it via kwarg to appropriate functions (the default will be the fastest way to produce a result, passing a kwarg will allow user to request a specific order but we need to learn what orders for what joins would be desirable).

bkamins avatar May 07 '21 21:05 bkamins

I recognize that performance concerns are likely what drove the change, but here is what I was expecting:

df1.__id1 = Base.OneTo(nrow(df1))
df2.__id2 = Base.OneTo(nrow(df2))
df3 = ___join(df1,df2,on=keys)

Use left table's row order and then right

left: sort!(df3, ["__id1","__id2"]) inner: sort!(df3, ["__id1","__id2"]) cross: sort!(df3, ["__id1","__id2"]) outer: sort!(df3, ["__id1","__id2"])

Use left table's row order only:

semi: sort!(df3, ["__id1"]) anti: sort!(df3, ["__id1"])

Use right table's row order and then left

right: sort!(df3, ["__id2","__id1"])

Argument in favor of ordering

leftjoin, semijoin, antijoin, and rightjoin all have a clearly implied direction where one table is joined onto the other table (or only one table remains). A naive implementation of these would simply iterate row-wise over the primary table and look for hash matches in the secondary table. That's where their ordering proposals above come from.

innerjoin, outerjoin, and crossjoin are ambiguous because they consider rows for both sides. Without rightjoin, table primacy could be consistently given to the first argument and the intuition for these would be obvious. In fact, the above proposal for the join sort order would actually be the same for all joins if only rightjoin didn't exist. So I guess I went with the leftjoin ordering for inner/outer/cross because I don't use rightjoin, but I admit that it's ambiguous.

Argument against ordering

The ANSI SQL standard does not define a row order resulting from joins. I can only find a free draft from 1992 but section 7.5 doesn't seem to mention it. All join implementation documentation that I have seen is also silent on the issue.

It's also obviously more computationally expensive to do ordering and offers no practical benefit if rows are replicated in both tables (like in a crossjoin).

But what if rows are replicated in only one table? (i.e. the argument for leftjoin!)

Maintaining row order in one table would allow for a copycols=false-like keyword argument to joins where you re-use that table's vectors if possible. This can save a bunch of allocations if one is, as I typically am, joining a very large table to smaller ones with no row replication in the very large table.

Joins in DataFrames.jl copy all vectors, it used to be much more efficient to do something like what is in the leftjoin!() below because you avoid copying columns <y.

using DataFrames # 0.22.7

nr = 1000000

df1 = DataFrame(a = rand(nr), b = rand(nr), c = rand(nr), d = rand(nr),
    e = rand(nr), f = rand(nr), g = rand(nr), h = rand(nr), i = rand(nr),
     j = rand(nr), k = rand(nr), l = rand(nr), m = rand(nr), n = rand(nr),
    o = rand(nr), p = rand(nr), q = rand(nr), r = rand(nr), s = rand(nr),
    t = rand(nr), u = rand(nr), v = rand(nr), w = rand(nr), x = rand(nr),
    y = rand(['a','b','c'], nr));
df2 = DataFrame(y = ['a','b','c'], z = [1,2,3]);

function leftjoin!(df1, df2; on)
    df_tmp = leftjoin(select(df1, on; copycols=false), df2, on=on); # order of df3 matches df1 if the join is unique
    df1.z = df_tmp.z
    df1
end

df0 = @time copy(df1; copycols=true); # 0.204061 seconds (94 allocations: 186.927 MiB) # only necessary for testing in my case
df3 = @time leftjoin!(df0,df2;on="y"); # 0.360305 seconds (246 allocations: 91.449 MiB)
df3 = @time leftjoin(df1, df2, on="y"); # 0.691530 seconds (589 allocations: 274.568 MiB)

Upgrading to DataFrames 1.1.1, the order of df1 is not preserved in the join so the leftjoin! has to get more complicated:

function leftjoin!(df1, df2; on)
    df1.__id = Base.OneTo(UInt32(nrow(df1)))
    df_tmp = leftjoin(select(df1, union([on],["__id"]); copycols=false), df2, on=on); # order of df3 matches df1 if the join is unique
    sort!(df_tmp, "__id")
    df1.z = df_tmp.z
    select!(df1, Not("__id"))
end

df0 = @time copy(df1; copycols=true); # 0.053088 seconds (322 allocations: 186.940 MiB) # only necessary for testing in my case
df3 = @time leftjoin!(df0,df2;on="y"); # 0.088398 seconds (513 allocations: 53.436 MiB)
df3 = @time leftjoin(df1, df2, on="y"); # 0.123257 seconds (854 allocations: 213.671 MiB)

So the order in particular doesn't matter much to me and even the more complicated wrapper is still way more efficient/faster than it was in 0.22.7, but it used to make it a lot easier to avoid allocations.

If you imagine:

df2 = leftjoin(leftjoin(leftjoin(df, model1, on=keys1), model2, on=keys2), model3, on=keys3)

then only the inner-most leftjoin ever needs to copy columns and the outer two could be leftjoin!(), about a 50% decrease in amount of RAM allocated using the above totals.

Summary

In the course of writing this post, I realized that instead of needing the order to be maintained, I really just need to write the leftjoin!() wrapper for myself and import it. It isn't that much more complicated than before and it is still way more efficient.

Byrth avatar May 08 '21 13:05 Byrth

the ! version of join is also on a to-do list, see https://github.com/JuliaData/DataFrames.jl/issues/1334 and https://github.com/JuliaData/DataFrames.jl/issues/2259.

bkamins avatar May 08 '21 15:05 bkamins

@pstorozenko - this is the next thing to think of in joins if you want 😄. There are two layers:

  • thinking of useful API; currently I think of order kwarg
  • designing an efficient implementation

Even starting with the minimal (but most common case), i.e. adding it only to leftjoin with order=:left would probably cover most of the use cases (and if we had this then we could think of extensions which should follow naturally).

bkamins avatar Jun 03 '21 18:06 bkamins

Thanks, but I'd like to understand better how to perform efficient joins in the first place. After looking at h2o benchmarks I want to take a deep dive into polars and data.table way of joining and compare it with DataFrames.jl. This might be very beneficial in the future :)

pstorozenko avatar Jun 03 '21 19:06 pstorozenko

Indeed this also makes a lot of sense. Could you please share a way to contact you directly (my e-mail is [email protected])? I think it would be good to discuss it and could share you my experience.

bkamins avatar Jun 03 '21 19:06 bkamins

BTW: the benchmarks are currently re-calculated and I expect improvements for joins in 5GB case (currently we are bogged down by GC). Hopefully tomorrow we will have an update on H2O website (the solution I proposed now does not resolve all problems but it should improve things).

bkamins avatar Jun 03 '21 19:06 bkamins

TODO: if we add this then we should also add leftjoin! and rightjoin! that would update left/right table respectively.

bkamins avatar Jul 29 '21 20:07 bkamins

Bumping this, I ran into this today.

I think we can add a keyword argument for preserving order without adding leftjoin!.

pdeffebach avatar Jul 21 '22 20:07 pdeffebach

Well this issue is super old. We added leftjoin! in the mean time that preserves order. I will mark it for 1.5 release.

bkamins avatar Jul 21 '22 21:07 bkamins

Ahh I forgot about that. Thanks.

pdeffebach avatar Jul 21 '22 21:07 pdeffebach