DataFrames.jl
DataFrames.jl copied to clipboard
Merge two DataFrames only to missing values
Suppose I have an dataframe called df_missing
:
6×3 DataFrame
│ Row │ id │ val │ other |
│ │ Int64 │ Int64? │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1 │ 5 │ 1 │ 4 │
│ 2 │ 2 │ missing │ 4 │
│ 3 │ 1 │ 3 │ 3 │
│ 4 │ 4 │ 8 │ 4 │
│ 5 │ 6 │ 2 │ 4 │
│ 6 │ 8 │ missing │ 3 │
and I also have another dataset, called df_completion
:
2x2 DataFrame
│ Row │ id │ val |
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 2 │ 5 │
│ 2 │ 8 │ 13 │
So my suggestion is: couldn't there be a more straightforward way to replace the missing values using the common id of the two dataframes, thus creating df_full
?
6×3 DataFrame
│ Row │ id │ val │ other |
│ │ Int64 │ Int64? │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1 │ 5 │ 1 │ 4 │
│ 2 │ 2 │ 5 │ 4 │
│ 3 │ 1 │ 3 │ 3 │
│ 4 │ 4 │ 8 │ 4 │
│ 5 │ 6 │ 2 │ 4 │
│ 6 │ 8 │ 13 │ 3 │
The two current ways that seem to be the best ones are:
julia> df_missing[in(df_completion.id).(df_missing.id), :val] = df_completion.val
or
julia> df_full = leftjoin(df_missing, df_completion, on = :id, makeunique = true);
julia> df_full.val = map(df_full.val, df_full.val_1) do a, b
ismissing(a) ? b : a
end;
julia> select!(df_full, Not(:val_1))
besides, of course, creating a loop. Maybe there could be a method for merging
two DataFrames in this way? I think it's a very common problem.
I do not think you can do any better than joining and then combining the results. In particular in general (if if were to become a standard library function) you would have to define what should happen if either of the data frames has a duplicate entry in :id
, what should happen if the right data frame has some entry, and the left data frame also contains it but it is not missing etc.
I general I believe that there are so many possible options that it is hard to define an API for such a function, simply an appropriate join should be used. Note then that the last operation you mention can be written as df_full.val .= coalesce.(df_full.val, df_full.val_1)
which will be fast and update df_full.val
in place.
Stata has a good API for this via update
update
: update missing values of same-named variables in master with values from using
I think Bogumil's idea for this is elegant, and is good for 1.0. But I think following Stata on this could solve the coordination problem for what a good option is.
My proposal would be
leftjoin(df1, df2, on = :id, update = :val)
This would perform the coalesce automatically. update
could be a column identifier or a vector thereof. Such columns would be excempt from makeunique = true
.
A quick google search shows that data.table
has something similar to this but dplyr
does not .
What does Stata do when:
- duplicate entries in
df2
are present? Throw an error? - duplicate entries in
df1
key are present and for one of them the value is non-missing, while for the other the value is missing? Is this also an error? (or maybe theon
column must be unique in general also indf1
in this case)?
- duplicate entries in
df2
are present? Throw an error?
Interestingly, this will add both observations
. use df_completion_dups, clear
. list
+----------+
| id val |
|----------|
1. | 2 220 |
2. | 2 240 |
+----------+
. use df_missing_normal.dta, clear
. qui merge 1:m id using df_completion_dups, update
. list
+--------------------------------+
| id val _merge |
|--------------------------------|
1. | 1 100 master only (1) |
2. | 2 220 missing updated (4) |
3. | 2 240 missing updated (4) |
+--------------------------------+
- duplicate entries in
df1
key are present and for one of them the value is non-missing, while for the other the value is missing? Is this also an error? (or maybe theon
column must be unique in general also indf1
in this case)?
You can specify whether or not the matches need to be unique. merge 1:1
will throw an error if there are any duplicates, merge m:1
(many to 1) will throw an error if the RHS (called using
) has duplicates etc.
The default behavior in this case is also to add new rows. See below.
. use df_completion_dups, clear
. list
+----------+
| id val |
|----------|
1. | 2 220 |
2. | 2 240 |
+----------+
. use df_missing_normal.dta, clear
. qui merge 1:m id using df_completion_dups, update
. list
+--------------------------------+
| id val _merge |
|--------------------------------|
1. | 1 100 master only (1) |
2. | 2 220 missing updated (4) |
3. | 2 240 missing updated (4) |
+--------------------------------+
I will have to think more about the mapping of Stata's 1:m
syntax to our joins and whether or not the behavior I propose would be intuitive.
OK - let us try to find out a consistent design requirements (like we did select
/transform
/combine
combo) and then we can implement it.
(and I will be commenting from a "defensive position" :smile:)
In particular I think that this operation is most intuitive if also allowed to be done "in place" (but we need API for this).
Thanks. I will try and work on a proposal that is similar to stata but matches our current infrastructure. This is yet another opportunity to combine the best of stata and dplyr.
I guess I am not that late to the party. Here would be the approach from SQL
UPDATE A
SET a = B.x, b = B.y
FROM B
WHERE A.id = B.id
Some kind of functionality like this, in general an "update" would be really nice to add. We just need to find a general design that is flexible and future proof (i.e. to have a swiss army knife function like e.g. combine
not just function that does only one specific thing).
As proposed above maybe we can add to leftjoin
an update
kwarg which takes a:
- pair (
:left => :right
) - single value as a shorthand for
:val => :val
- vector of the above (if more updates are required in one operation)
As proposed above maybe we can add to leftjoin an update kwarg which takes a: • pair ( :left => :right ) • single value as a shorthand for :val => :val • vector of the above (if more updates are required in one operation)
This works as long as you're not supporting various options on the update itself. Adding kwargs that only modify one other kwarg seems like a bad idea and cries out for splitting.
This is not a feature I've ever needed to use, so I don't have strong opinions on it, but the fact that it's found in stata and SQL makes a compelling case to add it as a separate function that could have its own options.
(i.e. to have a swiss army knife function like e.g. combine not just function that does only one specific thing).
Not sure if you've ever owned a Swiss army knife, but the downside of being able to do everything is that you don't do any of them super well. Not sure how much to torture the metaphor, but as a kid I always wanted the big fat ones with tons of different tools, because they seemed so cool, but my (Swiss) grandfather's with 2 blades and a bottle opener was almost always a better choice for actually doing anything with.
As discussed on Slack - we have a tension here. We can either leave joins "as is" and add a new method doing updates of existing columns and working in place (something like update!
) or we can extend existing functions with these functionalities. I am not really sure what would be the best approach.
Adding a new function is probably easier to understand. On the other hand it would do almost the same as leftjoin
"on steroids" so I am not really sure what would be best.
I think another function might be best. Since I was also going to bring up that not only does Stata have an update
option, it also has a replace
option.
So adding that would risk bloating the function even further. I would propose update
and overwrite
as added functions.
I would propose
update
andoverwrite
as added functions.
What would be the exact difference between them? I was thinking of an API of the kind:
update!(df1, df2, on=..., args)
and then args would be things that would specify the operations you want (so you could mix different operations you want: just add a column, replace, fill missings, whatever other ideas we have)
One only replaces if missing
, the other overwrites the whole columns. Maybe they could be the same function, though.
I would prefer the same function and just have some nice way to specify the operation you want to do, I guess there are 4 natural operations:
- fill missings
- replace where second data frame has a value
- fully replace (if second data frame does not have a value fill with
missing
) - add a column
Actually, I've just thought of a problem with this approach. Lets say you have df1 and df2 where df2 has many new columns you are interested :age, :marital_status
but also compete information for some columns in df1
. So df1
and df2
both have the varriable :income
but :income
has missing values in df1
When you do
innerjoin(df1, df2, on = :id, makeunique = true)
you get :income
and :income_1
. A separate update
function wouldn't help with this
- No need to use
update
after the join happens: you can clean it up yourself easily - Doing
update
before the join happens will still result in either an error or:income_1
- You can't do just the
update
since then you would lose the demographic variables of interest.
This kind of scenario is an argument for an update
argument in innerjoin
.
I do not think it would be a problem, as adding demographic columns, would be a type 4 of the operation (just add columns).
So in a sense update
would be a kind of select
from df2
into df1
. If you just pass a column - it behaves in the same way as join
(except that by default nothing is added - join
is a kind of transform
in this sense), but you are allowed to do 3 different operations (update existing column in modes: only missing, everywhere where present and just replace old column).
In Stata terms this is the keepusing
keyword argument, where you choose to keep
certain variables from the using
(RHS) data set.
I'm not sure I totally agree with the approach above because generally the number of columns I want to be added and not overridden is an order of magnitude larger than the columns I do want overridden. I would inevitably do update(df1, df2, on = :id, overwrite = [:income], add = :)
because I don't want to specify all of the demographic variables to tack on individually.
An additional problem: We have 5 different join
functions, and all of them, save maybe antijoin
, could use this kind of feature. I'm worried that we will have to re-invent complicated rules for the creation of new columns.
Yeah, as usual - given the Stata experience - is it possible to put some proposal to start with?
The SQL syntax I find it easier (example of in-place),
UPDATE
t1
SET
t1.c1 = COALESCE(t1.c1, t2.c1)
FROM
t1
INNER JOIN t2
ON t1.id = t2.id
;
@Nosferican this is mostly what I have imagined ideally but without metaprogramming it is hard to come up with a concise syntax, i.e. it could be something like:
update!(fun, df1, df2, on=:id)
where fun
would get two data frames as an argument having only matching rows in df1
and df2
and would follow the same rules of processing output as in combine(fun, df)
. Then what it produces would be appended to df1
and if the produced result would have columns that match the existing column then the column would be updated in matching places.
So your example would read:
update!(t1, t2, on=:id) do (t1, t2)
return (c1 = coalesce.(t1.c1, t2.c2),)
end
This would be almost fully flexible (it would not allow full column replacement, but it is easy enough to almost achieve it using leftjoin
).
Still I am not fully happy with this (nor any other syntax I could come up with). But maybe we will find the update!
I propose "good enough"?
OK. I have thought about it and here are some conclusions. Joins in DataFrames.jl work in three steps (I omit implementation details):
- ROW MATCHING: appropriately resize
df1
anddf2
in terms of rows, drop fromdf2
joining columns, create two vectors indicating if given row was originally present indf1
anddf2
- TABLE MERGING: merge resized
df1
anddf2
- POST PROCESSING: optionally add a column which is a transformation of indicator column
What we discuss here is essentially replacing TABLE MERGING step by a custom function. This means (I give na example on leftjoin
but it applies to all other joins) that:
leftjoin(df1, df2, on=cols)
can be seen as calling
leftjoin(default_joiner, df1, df2, on=cols)
where default_joiner
is a function taking 4 arguments (resized df1
, resized df2
, vector of row indicators for df1
, vector of row indicators for df2
) and performs a merge of these two data frames (by default just hcat
-ing them; note that default_joiner
mutates the first argument passed in place).
And actually we could expose this leftjoin(default_joiner, df1, df2, on=cols)
as a public API and allow for users to provide custom joiners (where all things that are asked for can be implemented, most probably using do
block).
Now if we went this way it is also natural how leftjoin!
should be defined. Currently leftjoin
passes to default_joiner
a copy of df1
. Now leftjoin!
would manipulate df1
in place before passing it to default_joiner
and by this the effect would be that leftjoin!
would mutate df1
in place as needed.
The only problem is that leftjoin(custom_joiner, df1, df2, on=cols)
is a very low level API, but maybe it is good enough for practical purposes?
I also found this feature would be much helpful in our work.
the update!
indicates the contents of df1
will be refreshed by the corresponding contents of df2
, which match the leftjoin
, while merge!
means the content of df1
and df2
should be combined, which match the outerjoin
. so I think the join
APIs are appropriate to include this feature.
the problems here are how to handle content conflicts. current implementation will throw error or if makeunique=true
leave conflicts to user.
but for some common use cases, for example, to update/leftjoin
df1
, if conflicts occurs, we usually want to overwrite content of df1
with the corresponding content of df2
. From a user perspective, keywords makeunique=false, overwrite=true
will be clear enough to express the intentions. of course, the content here would be single cell in a spreadsheet, not the whole column. if missing is to filled as much as possible, means if df1
have missing and df2
not, missing is replaced, but if df1
not and df2
have missing, data will not be evacuated by missing, another keywords writemissing=false
should be ok.
for flexibility, a relatively low-level API @bkamins mentioned may be needed, but to the most users, I think join
would be a good target to incorporate this feature.
I do not think you can do any better than joining and then combining the results. In particular in general (if if were to become a standard library function) you would have to define what should happen if either of the data frames has a duplicate entry in
:id
, what should happen if the right data frame has some entry, and the left data frame also contains it but it is not missing etc.I general I believe that there are so many possible options that it is hard to define an API for such a function, simply an appropriate join should be used. Note then that the last operation you mention can be written as
df_full.val .= coalesce.(df_full.val, df_full_val_1)
which will be fast and updatedf_full.val
in place.
I guess df_full_val_1
should be df_full.val_1
?
If it's not rude to ask: May I know what does df_full.val_1
mean? A duplicate of df_full.val
? Also, could I know the mechanism underlying df_full.val .= coalesce.(df_full.val, df_full.val_1)
? I didn't find a clue either in Julia's Doc or in DataFrames's Doc.
Thanks for your tremendous contribution to DataFrames, @bkamins !
I guess
df_full_val_1
should bedf_full.val_1
?
Yes, it was a typo; fixed
May I know what does
df_full.val_1
mean?
It gives you a vector stored as a column val_1
in data frame df_full
. I have quoted it as it was created in the original question in : leftjoin(df_missing, df_completion, on = :id, makeunique = true)
operation. This style is not recommended (because it can be confusing as you witness) but it is supported. The thing is that df_missing
and df_completion
have the same column name, so makeunique=true
de-duplicates them by adding _1
suffix to the column name of the right data frame as it is explained in the docstring of leftjoin
function.
Also, could I know the mechanism underlying
df_full.val .= coalesce.(df_full.val, df_full.val_1)
This operation, mostly, has nothing to do with DataFrames.jl. It is a standard Julia Base broadcasting syntax as explained here in the Julia Manual, and coalesce
function is documented here in the Julia Manual. Additionally, broadcasting rules specific to DataFrames.jl are documented here.
I guess
df_full_val_1
should bedf_full.val_1
?Yes, it was a typo; fixed
May I know what does
df_full.val_1
mean?It gives you a vector stored as a column
val_1
in data framedf_full
. I have quoted it as it was created in the original question in :leftjoin(df_missing, df_completion, on = :id, makeunique = true)
operation. This style is not recommended (because it can be confusing as you witness) but it is supported. The thing is thatdf_missing
anddf_completion
have the same column name, somakeunique=true
de-duplicates them by adding_1
suffix to the column name of the right data frame as it is explained in the docstring ofleftjoin
function.Also, could I know the mechanism underlying
df_full.val .= coalesce.(df_full.val, df_full.val_1)
This operation, mostly, has nothing to do with DataFrames.jl. It is a standard Julia Base broadcasting syntax as explained here in the Julia Manual, and
coalesce
function is documented here in the Julia Manual. Additionally, broadcasting rules specific to DataFrames.jl are documented here.
Thank you so much @bkamins for this quick response! Even though df_full = leftjoin(df_missing, df_completion, on = :id, makeunique = true)
operation is not desirable, we have to run it before running df_full.val .= coalesce.(df_full.val, df_full.val_1)
. Otherwise, we won't be able to get df_full
in the first place.
Thank you for pointing me to the documentation for the doc syntax, coalesce
and broadcasting
. It helped me a lot!
OK. I think we are ready to add it in DataFrames.jl 1.5 release.
I was thinking about the signature and came up with (in-place version):
update!(df1, df2, cols...; on, matchmissing=:error, rule=:all)
(also a copying update
version would be added; also maybe the order of df1
and df2
should be reversed? - more on it below)
The update!
function, like leftjoin!
would require that df2
has unique matching rows (so if duplicates in matching rows are present an error is thrown)
The meaning of the arguments is:
-
cols
positional arguments specifying columns to be updated; if multiplecols
arguments are passed they are processed sequentially left to right. Anycols
argument can be:- any column selector (single column, vector,
Cols
,Not
, ...) that will be applied todf2
table; then columns fromdf1
with the same name will be updated (if columns are missing indf1
they are ignored) - a pair
df2_col => df1_col
indicating which column fromdf2
should be used to update which column indf1
(again - maybe the order should be reversed; in this case both selectors must be single column selectors); ifdf1_col
is missing error
- any column selector (single column, vector,
-
on
,matchmissing
: as in other joins -
rule
defining how to handle the update. The options would be:- function - takes two arguments the value from
df2
and the value fromdf1
and returning the desired value (the first passed value will be fromdf2
the second fromdf1
as this is the director of updating - but we might want to reverse this order) -
:all
- update all matching rows indf1
with rows fromdf2
, equivalent to(x, y) -> x
-
:missing
- only update rows indf1
that have amissing
value with a value fromdf2
, equivalent to passingcoalesce
function
- function - takes two arguments the value from
Now the things to discuss are:
- in three places the order of
df1
anddf2
matters; first isupdate!
call (if the updated table should go first or second); then there is a syntax in cols for doing column name matching; finally therule
function also depends on the order of arguments; I was not sure which order to use - are we OK to error if duplicate keys in
df2
are found (I find it natural, but maybe there are cases when this is not desirable) - I proposed to ignore situation when there are columns in
df2
that user wants to use that are missing indf1
, but throw an error if pair notation is used. The alternative behaviors could be to have an extra kwarg with three options: 1) error, 2) ignore, 3) add new column todf1
if it is missing (and assume it had allmissing
values initially)
CC @nalimilan
@jariji - this is another important PR where API design is a key challenge.
There's an operation I perform a lot. I'm adding here because I think it's related. Maybe this is worth considering now that the API is under development. Alternatively, it'd be nice to consider this functionality.
Basically, the type operation is as follows
gdf = groupby(df[condition,:], groups)
result = combine(gdf, col => func => pre_alloc)
leftjoin!(df, result, on=groups)
A MWE
dff = DataFrame(industry = ["beer", "beer", "wine", "wine"],
is_public =[true,false, true,false],
firm_revenue = collect(10:10:40))
4×3 DataFrame
Row │ industry is_public firm_revenue
│ String Bool Int64
─────┼───────────────────────────────────
1 │ beer true 10
2 │ beer false 20
3 │ wine true 30
4 │ wine false 40
gdf = groupby(dff[dff.is_public .== true,:], :industry)
result = combine(gdf, :firm_revenue => sum => :revenue_of_all_public_firms)
2×2 DataFrame
Row │ industry revenue_of_all_public_firms
│ String Int64
─────┼───────────────────────────────────────
1 │ beer 10
2 │ wine 30
and then i "update" the original dataframe
leftjoin!(dff, result, on=:industry)
4×4 DataFrame
Row │ industry is_public firm_revenue revenue_of_all_public_firms
│ String Bool Int64 Union{Missing, Int64}
─────┼────────────────────────────────────────────────────────────────
1 │ beer true 10 10
2 │ beer false 20 10
3 │ wine true 30 30
4 │ wine false 40 30
I guess update
could potentially handle this operation more efficiently than leftjoin
?
I was thinking of updating an empty column with the results of combine
. Notice here, unlike the case updating a dataframe using a separate dataframe, we'd update values with results that come from the parent DataFrame. Usually leftjoin!
is quite expensive in my work, but don't know if it's room for improvement for this particular application.
Thanks!!!
Currently you can almost do what you want efficiently with:
julia> dff = DataFrame(industry = ["beer", "beer", "wine", "wine"],
is_public =[true,false, true,false],
firm_revenue = collect(10:10:40))
4×3 DataFrame
Row │ industry is_public firm_revenue
│ String Bool Int64
─────┼───────────────────────────────────
1 │ beer true 10
2 │ beer false 20
3 │ wine true 30
4 │ wine false 40
julia> gdf = groupby(subset(dff, :is_public, view=true), :industry)
GroupedDataFrame with 2 groups based on key: industry
First Group (1 row): industry = "beer"
Row │ industry is_public firm_revenue
│ String Bool Int64
─────┼───────────────────────────────────
1 │ beer true 10
⋮
Last Group (1 row): industry = "wine"
Row │ industry is_public firm_revenue
│ String Bool Int64
─────┼───────────────────────────────────
1 │ wine true 30
julia> transform!(gdf, :firm_revenue => sum => :revenue_of_all_public_firms)
2×4 SubDataFrame
Row │ industry is_public firm_revenue revenue_of_all_public_firms
│ String Bool Int64 Union{Missing, Int64}
─────┼────────────────────────────────────────────────────────────────
1 │ beer true 10 10
2 │ wine true 30 30
julia> dff
4×4 DataFrame
Row │ industry is_public firm_revenue revenue_of_all_public_firms
│ String Bool Int64 Union{Missing, Int64}
─────┼────────────────────────────────────────────────────────────────
1 │ beer true 10 10
2 │ beer false 20 missing
3 │ wine true 30 30
4 │ wine false 40 missing
almost - because, as you can see, we will have missing
in the rows that do not match the selection (this can be fixed in post-processing currently).
Another approach, but it is less general is:
julia> dff = DataFrame(industry = ["beer", "beer", "wine", "wine"],
is_public =[true,false, true,false],
firm_revenue = collect(10:10:40))
4×3 DataFrame
Row │ industry is_public firm_revenue
│ String Bool Int64
─────┼───────────────────────────────────
1 │ beer true 10
2 │ beer false 20
3 │ wine true 30
4 │ wine false 40
julia> gdf = groupby(dff, :industry)
GroupedDataFrame with 2 groups based on key: industry
First Group (2 rows): industry = "beer"
Row │ industry is_public firm_revenue
│ String Bool Int64
─────┼───────────────────────────────────
1 │ beer true 10
2 │ beer false 20
⋮
Last Group (2 rows): industry = "wine"
Row │ industry is_public firm_revenue
│ String Bool Int64
─────┼───────────────────────────────────
1 │ wine true 30
2 │ wine false 40
julia> transform!(gdf, [:firm_revenue, :is_public] => ((a, b) -> sum(a[b])) => :revenue_of_all_public_firms) # this would be shorter with DataFramesMeta.jl
4×4 DataFrame
Row │ industry is_public firm_revenue revenue_of_all_public_firms
│ String Bool Int64 Int64
─────┼────────────────────────────────────────────────────────────────
1 │ beer true 10 10
2 │ beer false 20 10
3 │ wine true 30 30
4 │ wine false 40 30
Yes, that was my point. I use combine
+ leftjoin!
to avoid having missing values, and because methods like the second example are less general and only applicable to specific cases. My solution is simple in terms of syntax, but really slow. That's why I thought that update
could handle this.
I was thinking along the lines of using:
- create
view
ofdf
with some condition - create
gdf
with groups - use
combine
and getgdf_combine
-
update(parent_df, gdf_combine, on=groups).
It'd be like fillingfill
the parent df for each combination ofgroups
.
These operations are quite common I think. Basically, you put all industry information in columns to compute each firm's result relative to industry measures (e.g., a market share, although computing market shares can obviously be done in a simpler way than all this).
Maybe update
needs to handle more general cases, and this should be done through a separate function. Just wanted to comment this, so you could have it in mind.
Thanks!!!