polars
polars copied to clipboard
Don't coalesce joining column by default in left join?
Description
Background
In #12963 it was decided that, in outer joins, columns wouldn't be coalesced - instead, both should be kept
Shouldn't the same be done for left joins?
If the names don't overlap, this is what pandas does:
In [54]: left = pd.DataFrame({'a': [1,2]})
...: right = pd.DataFrame({'b': [2,4]})
...: left.merge(right, left_on='a', right_on='b', how='left')
Out[54]:
a b
0 1 NaN
1 2 2.0
In [55]: left = pl.DataFrame({'a': [1,2]})
...: right = pl.DataFrame({'b': [2,4]})
...: left.join(right, left_on='a', right_on='b', how='left')
Out[55]:
shape: (2, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 2 │
└─────┘
If the names do overlap, pandas only keeps a single column:
In [61]: left = pd.DataFrame({'a': [1,2]})
...: right = pd.DataFrame({'a': [2,4]})
...: left.merge(right, left_on='a', right_on='a', how='left')
Out[61]:
a
0 1
1 2
Request
In light of the linked PR, I think I'd have expected:
In [63]: left = pl.DataFrame({'a': [1,2]})
...: right = pl.DataFrame({'a': [2,4]})
...: left.join(right, left_on='a', right_on='a', how='outer')
Out[63]:
shape: (2, 2)
┌──────┬─────────┐
│ a ┆ a_right │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪═════════╡
│ 2 ┆ 2 │
│ 1 ┆ null │
└──────┴─────────┘
and
In [64]: left = pl.DataFrame({'a': [1,2]})
...: right = pl.DataFrame({'b': [2,4]})
...: left.join(right, left_on='a', right_on='b', how='outer')
Out[64]:
shape: (2, 2)
┌──────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 2 ┆ 2 │
│ 1 ┆ null │
└──────┴──────┘
and for the current behaviour to be given by how='left_coalesce'
This is overlapping with these 3:
https://github.com/pola-rs/polars/issues/13130 https://github.com/pola-rs/polars/issues/13220 https://github.com/pola-rs/polars/issues/6165
I think we should instead have a separate parameter called coalesce
which is a bool.
In joining, the only time any of this matters is a left/right/outer join, in which the key on one side has no corresponding key on the other. This could all be captured by a single boolean column that indicates whether a join was successful or not (or an enum that says 'Left'/'Right'/'Both' to indicate which frame the value came from). But barring that, the only reason to distinguish between whether the value came from the left or right frame is when one is null and the other isn't.
The non-coalesce situation to me feels very redundant. If you have many columns, they're going to look like this (say we have a left join)
Key | A_left | A_right | B_left | B_right | C_left | C_right | comment |
---|---|---|---|---|---|---|---|
key1 | a1 | null | b1 | null | c1 | null | no match |
key2 | a2 | a2 | b2 | b2 | c2 | c2 | match |
key3 | a3 | a3 | b3 | b3 | c3 | c3 | match |
In other words, all of our _right
columns are redundant with a single column declaring whether there was a successful match or not.
@mcrumiller A couple counter examples.
- join_asof it wouldn't make sense to do the comment column.
- It also wouldn't make sense if we're doing
left_on='a', right_on=pl.col('b').expression_that_makes_b_look_like_a()
if we want to retain what b started as (something like truncating a datetime, for example.)
join_asof
only works on a single join column, but yes I agree there that it would be nice to retain the asof
column and that the above example does not apply in that case.
For your second example (I assume we're no longer talking about an asof-join), I think my example still does apply. Column b
would be returned as if it were a non-key column. The expression defining the right join key column would end up just being B_right
above--either exactly equal to a
or null, with no exceptions.
I am opposed to a coalesce
boolean flag as it is not valid for all join types. I do however think we must change the left join and create a "left_coalesce"
join.
Is the plan to do this in some 0.20.x
release or in 1.0
?
In my projects that use polars at work, I'm pinning specific versions instead of just pinning polars < 1.0
, because I'm scared wondering when this change is going to happen.
@s-banach it's currently milestoned for 1.0
We want to pick this up for 1.0. The current blocker is that we do not support left outer non-coalesce join in the streaming engine.
This will be deprecated in 0.20.x, and broken with the 1.0.0 release. I'll pick up the deprecation.