DataFrames.jl
DataFrames.jl copied to clipboard
Match missing to all values
I propose another matchmissing=
setting, where missing
on the left side matches every value on the right, not just missing
on the right. We don't know which values the missing
s match, and sometimes I want to match all of them.
I understand you mean joins. In what cases would you want to do this? This option is disallowed, because it would be essentially cartesian join (almost), so it is only feasible for very small tables. I understand you have a practical use-case for such a scenario?
The join is on [:a,:b]
, and only :b
has missings, so the tables can be large since the match-all policy works only within rows for which :a
already matches.
The use case is to refine a join by adding the additional column :b
to the ;on=
list, but :b
is sometimes missing, in which case it should not restrict the output.
To be more precise, I want to say matchmissing for :left_b => :right_b = :matchall
but matchmising for :a = :error
.
The existing way to do this is to join only on :a
and then in a second step subset
based on :b
.
innerjoin(l,r; on=[:a,:b], matchmissing=[:a=>:error, :b => :matchall])
This is close, though it's still ambiguous whether the missing
s are expected to be on the left or the right or both.
The use case is to refine a join by adding the additional column :b
to the ;on=
list, but :b
is sometimes missing, in which case it should not restrict the output.
To be more precise, I want to say matchmissing for :left_b => :right_b = :matchall
but matchmising for :a = :error
.
The existing way to do this is to join only on :a
and then in a second step subset
based on :b
.
innerjoin(l,r; on=[:a,:b], matchmissing=[:a=>:error, :b => :matchall])
This is close, though it's still ambiguous whether the missing
s are expected to be on the left or the right or both.
I thought about it. We could make such extension, but maybe it is better to handle this in https://github.com/JuliaData/DataFrames.jl/issues/2738 (where I assume that any condition for matching rows can be passed). Would this be sufficient?