DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Add `innerjoin!` to accompany `leftjoin!`?

Open ararslan opened this issue 2 years ago • 8 comments

It can be defined naively as

function innerjoin!(a, b; kwargs...)
    leftjoin!(a, b; kwargs..., indicator=:_row_source)
    subset!(a, :_row_source => ByRow(==("both")))
    select!(a, Not(:_row_source))
    return a
end

That is, simply delete (in-place) the rows not present in both tables after performing an in-place left join.


This was discussed briefly in Slack, re-posting comments here for posterity/context:

@bkamins

We could do it, but I felt it is unsafe, because it drops rows from the left table in-place. The important invariant of leftjoin! is that it does not touch columns of the left table at all.

If you feel innerjoin! would be valuable to add please open an issue, but the point is that there is not much benefit of innerjoin! over innerjoin in terms of performance (as we have to copy all columns anyway). The only potential difference would be row order.

Also note that leftjoin! is not the same as leftjoin but just in-place. The reason is that leftjoin! requires that there are no duplicates in matches, so it is kind-of special (again - the reason is that leftjoin! does not touch the columns of the left table).

@pdeffebach

If we go down that road we may end up in a place where we have a whole api which just does

function foo!(df)
    empy!(df)
    ...
    df
end

@ararslan

I'm not sure what you mean, @pdeffebach. What's being emptied and why?

Thank you for the explanation, @bkamins, that makes sense. I think personally I would disagree though with "I felt it is unsafe, because it drops rows from the left table in-place" since I would expect (by the nature of an inner join) that it could drop rows. I've definitely thought less about these things than you have though :smile:

@bkamins

If you feel such innerjoin! Would be useful please open an issue and we can discuss it. It is a minor issue with safety. What @pdeffebach means is that innerjoin! Would need to empty the left data frame and then fill it from scratch - there would be no performance benefit of innerjoin! Over innerjoin.

@pdeffebach

Yes, it could maybe be faster to modify the arrays in place, but for lots of scenarios it would be faster to allocate new ones.

There was some talk a while ago about !! for modifying the underlying arrays, and ! for modifying the data frame otherwise. But iirc that was deemed a bit too complicated

@nalimilan

In theory we could call deleteat! on left column vectors to avoid allocating new ones I guess? The benefit may not be super large but if you're short on memory it might matter.

@bkamins

Yes, that is why I ask @ararslan to open an issue to discuss as if the function is really useful we could add it.

ararslan avatar Dec 07 '21 21:12 ararslan

@ararslan two questions (that is why I wanted it to be persistently stored):

  1. do you want to keep order of rows in the left table with innerjoin! or you do not care.
  2. do you want to restrict the functionality to cases where are no duplicates in matching rows from the right table.

(currently if you want 1. you need 2. as otherwise rows of left table would get multiplied)

bkamins avatar Dec 07 '21 21:12 bkamins

  1. No preference
  2. Matching whatever leftjoin! does here seems worthwhile but I have no real strong preferences here either

ararslan avatar Dec 08 '21 00:12 ararslan

OK - I then put it on 1.4 release schedule. The release will not be soon, so I will wait a bit for possible comments, especially about expected functionality (since @ararslan does not have a strong preference 😄).

CC @nalimilan

bkamins avatar Dec 08 '21 07:12 bkamins

Related to this, an in-place semijoin could be similarly useful and perhaps more straightforward to implement.

ararslan avatar Dec 27 '21 21:12 ararslan

I would like to use innerjoin! like leftjoin! but to make sure that all values from left occur 'exactly once' in right instead of 'at most once'.

I often use leftjoin! to make sure the other columns are not mutated, so that I am sure everything will also work with memory mapped dataframes (using Arrow.jl), where columns are read-only (normal in-memory columns can still be added). If innerjoin! will be added I think it should also be usable for this use case. This would mean that it has to throw an exception when not every left row has exactly one match (since that is the only possible valid inner join without having to modify other columns). innerjoin! then also has the added benefit that it can use non-optional types for the new columns. This I would really like since I often do disallowmissing! directly after leftjoin!.

stanvanlier avatar Feb 28 '22 19:02 stanvanlier

@stanvanlier - so for innerjoin! you would want to assure that there is a bijection between keys left and right data frame? (which in consequence would mean, in particular, that the columns can disallow missings).

If this is the feature you want then maybe it would make more sense to add a kwarg for leftjoin! that would check for this (I am not sure what would be better - I just want to open a discussion so that we can pick a right design).

bkamins avatar Feb 28 '22 20:02 bkamins

Yes but not necessarily only a bijection since right may contain keys not being in left. (or is that still a bijection). So innerjoin!(x, y, on=:z) could work while innerjoin!(y, x, on=:z) throws a 'not all keys in y are are present in x' exception.

On the other hand, if semijoin! is also added, it probably should be consistent with innerjoin! and a kwarg for leftjoin! would be better.

Then again, if innerjoin! would be the same as x = innerjoin(x, y, on=:z) it will be (more) confusing that leftjoin! is not the same as x = leftjoin(x, y, on=:z)

stanvanlier avatar Mar 01 '22 11:03 stanvanlier

or is that still a bijection

no - it is injection (but not surjection), see https://en.wikipedia.org/wiki/Injective_function. I hope now I understand you correctly.

I will think what is best given your input and comment later. Thank you!

bkamins avatar Mar 01 '22 11:03 bkamins