DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Match missing to all values

Open jariji opened this issue 1 year ago • 3 comments

I propose another matchmissing= setting, where missing on the left side matches every value on the right, not just missing on the right. We don't know which values the missings match, and sometimes I want to match all of them.

jariji avatar Jul 15 '22 07:07 jariji

I understand you mean joins. In what cases would you want to do this? This option is disallowed, because it would be essentially cartesian join (almost), so it is only feasible for very small tables. I understand you have a practical use-case for such a scenario?

bkamins avatar Jul 15 '22 07:07 bkamins

The join is on [:a,:b], and only :b has missings, so the tables can be large since the match-all policy works only within rows for which :a already matches.

The use case is to refine a join by adding the additional column :b to the ;on= list, but :b is sometimes missing, in which case it should not restrict the output.

To be more precise, I want to say matchmissing for :left_b => :right_b = :matchall but matchmising for :a = :error.

The existing way to do this is to join only on :a and then in a second step subset based on :b.

innerjoin(l,r; on=[:a,:b], matchmissing=[:a=>:error, :b => :matchall])

This is close, though it's still ambiguous whether the missings are expected to be on the left or the right or both.

jariji avatar Jul 15 '22 07:07 jariji

The use case is to refine a join by adding the additional column :b to the ;on= list, but :b is sometimes missing, in which case it should not restrict the output.

jariji avatar Jul 15 '22 07:07 jariji

To be more precise, I want to say matchmissing for :left_b => :right_b = :matchall but matchmising for :a = :error.

jariji avatar Jul 15 '22 07:07 jariji

The existing way to do this is to join only on :a and then in a second step subset based on :b.

jariji avatar Jul 15 '22 18:07 jariji

innerjoin(l,r; on=[:a,:b], matchmissing=[:a=>:error, :b => :matchall])

This is close, though it's still ambiguous whether the missings are expected to be on the left or the right or both.

jariji avatar Jul 15 '22 21:07 jariji

I thought about it. We could make such extension, but maybe it is better to handle this in https://github.com/JuliaData/DataFrames.jl/issues/2738 (where I assume that any condition for matching rows can be passed). Would this be sufficient?

bkamins avatar Dec 27 '22 21:12 bkamins