DataFrames.jl Merge two DataFrames only to missing values

Merge two DataFrames only to missing values

Open chedieck opened this issue 4 years ago • 36 comments

Suppose I have an dataframe called df_missing:

6×3 DataFrame
│ Row │ id    │ val     │ other |
│     │ Int64 │ Int64?  │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1   │ 5     │ 1       │ 4     │
│ 2   │ 2     │ missing │ 4     │
│ 3   │ 1     │ 3       │ 3     │
│ 4   │ 4     │ 8       │ 4     │
│ 5   │ 6     │ 2       │ 4     │
│ 6   │ 8     │ missing │ 3     │

and I also have another dataset, called df_completion:

2x2 DataFrame
│ Row │ id    │ val   |
│     │ Int64 │ Int64 │ 
├─────┼───────┼───────┤
│ 1   │ 2     │ 5     │ 
│ 2   │ 8     │ 13    │

So my suggestion is: couldn't there be a more straightforward way to replace the missing values using the common id of the two dataframes, thus creating df_full?

6×3 DataFrame
│ Row │ id    │ val     │ other |
│     │ Int64 │ Int64?  │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1   │ 5     │ 1       │ 4     │
│ 2   │ 2     │ 5       │ 4     │
│ 3   │ 1     │ 3       │ 3     │
│ 4   │ 4     │ 8       │ 4     │
│ 5   │ 6     │ 2       │ 4     │
│ 6   │ 8     │ 13      │ 3     │

The two current ways that seem to be the best ones are: julia> df_missing[in(df_completion.id).(df_missing.id), :val] = df_completion.val or

julia> df_full = leftjoin(df_missing, df_completion, on = :id, makeunique = true);

julia> df_full.val = map(df_full.val, df_full.val_1) do a, b
       ismissing(a) ? b : a
       end;

julia> select!(df_full, Not(:val_1))

besides, of course, creating a loop. Maybe there could be a method for merging two DataFrames in this way? I think it's a very common problem.

May 11 '20 10:05 chedieck

I do not think you can do any better than joining and then combining the results. In particular in general (if if were to become a standard library function) you would have to define what should happen if either of the data frames has a duplicate entry in :id, what should happen if the right data frame has some entry, and the left data frame also contains it but it is not missing etc.

I general I believe that there are so many possible options that it is hard to define an API for such a function, simply an appropriate join should be used. Note then that the last operation you mention can be written as df_full.val .= coalesce.(df_full.val, df_full.val_1) which will be fast and update df_full.val in place.

May 11 '20 11:05 bkamins

Stata has a good API for this via update

update: update missing values of same-named variables in master with values from using

I think Bogumil's idea for this is elegant, and is good for 1.0. But I think following Stata on this could solve the coordination problem for what a good option is.

My proposal would be

leftjoin(df1, df2, on = :id, update = :val)

This would perform the coalesce automatically. update could be a column identifier or a vector thereof. Such columns would be excempt from makeunique = true.

A quick google search shows that data.table has something similar to this but dplyr does not .

May 11 '20 11:05 pdeffebach

What does Stata do when:

duplicate entries in df2 are present? Throw an error?
duplicate entries in df1 key are present and for one of them the value is non-missing, while for the other the value is missing? Is this also an error? (or maybe the on column must be unique in general also in df1 in this case)?

May 11 '20 12:05 bkamins

duplicate entries in df2 are present? Throw an error?

Interestingly, this will add both observations

. use df_completion_dups, clear

. list

     +----------+
     | id   val |
     |----------|
  1. |  2   220 |
  2. |  2   240 |
     +----------+

. use df_missing_normal.dta, clear

. qui merge 1:m id using df_completion_dups, update

. list

     +--------------------------------+
     | id   val                _merge |
     |--------------------------------|
  1. |  1   100       master only (1) |
  2. |  2   220   missing updated (4) |
  3. |  2   240   missing updated (4) |
     +--------------------------------+

duplicate entries in df1 key are present and for one of them the value is non-missing, while for the other the value is missing? Is this also an error? (or maybe the on column must be unique in general also in df1 in this case)?

You can specify whether or not the matches need to be unique. merge 1:1 will throw an error if there are any duplicates, merge m:1 (many to 1) will throw an error if the RHS (called using) has duplicates etc.

The default behavior in this case is also to add new rows. See below.

. use df_completion_dups, clear

. list

     +----------+
     | id   val |
     |----------|
  1. |  2   220 |
  2. |  2   240 |
     +----------+

. use df_missing_normal.dta, clear

. qui merge 1:m id using df_completion_dups, update

. list

     +--------------------------------+
     | id   val                _merge |
     |--------------------------------|
  1. |  1   100       master only (1) |
  2. |  2   220   missing updated (4) |
  3. |  2   240   missing updated (4) |
     +--------------------------------+

I will have to think more about the mapping of Stata's 1:m syntax to our joins and whether or not the behavior I propose would be intuitive.

May 11 '20 13:05 pdeffebach

OK - let us try to find out a consistent design requirements (like we did select/transform/combine combo) and then we can implement it.

(and I will be commenting from a "defensive position" :smile:)

In particular I think that this operation is most intuitive if also allowed to be done "in place" (but we need API for this).

May 11 '20 13:05 bkamins

Thanks. I will try and work on a proposal that is similar to stata but matches our current infrastructure. This is yet another opportunity to combine the best of stata and dplyr.

May 11 '20 13:05 pdeffebach

I guess I am not that late to the party. Here would be the approach from SQL

UPDATE A
SET a = B.x, b = B.y
FROM B
WHERE A.id = B.id

May 14 '20 22:05 Nosferican

Some kind of functionality like this, in general an "update" would be really nice to add. We just need to find a general design that is flexible and future proof (i.e. to have a swiss army knife function like e.g. combine not just function that does only one specific thing).

As proposed above maybe we can add to leftjoin an update kwarg which takes a:

pair (:left => :right)
single value as a shorthand for :val => :val
vector of the above (if more updates are required in one operation)

May 14 '20 22:05 bkamins

As proposed above maybe we can add to leftjoin an update kwarg which takes a: • pair ( :left => :right ) • single value as a shorthand for :val => :val • vector of the above (if more updates are required in one operation)

This works as long as you're not supporting various options on the update itself. Adding kwargs that only modify one other kwarg seems like a bad idea and cries out for splitting.

This is not a feature I've ever needed to use, so I don't have strong opinions on it, but the fact that it's found in stata and SQL makes a compelling case to add it as a separate function that could have its own options.

(i.e. to have a swiss army knife function like e.g. combine not just function that does only one specific thing).

Not sure if you've ever owned a Swiss army knife, but the downside of being able to do everything is that you don't do any of them super well. Not sure how much to torture the metaphor, but as a kid I always wanted the big fat ones with tons of different tools, because they seemed so cool, but my (Swiss) grandfather's with 2 blades and a bottle opener was almost always a better choice for actually doing anything with.

May 26 '20 10:05 kescobo

As discussed on Slack - we have a tension here. We can either leave joins "as is" and add a new method doing updates of existing columns and working in place (something like update!) or we can extend existing functions with these functionalities. I am not really sure what would be the best approach.

Adding a new function is probably easier to understand. On the other hand it would do almost the same as leftjoin "on steroids" so I am not really sure what would be best.

May 26 '20 13:05 bkamins

I think another function might be best. Since I was also going to bring up that not only does Stata have an update option, it also has a replace option.

So adding that would risk bloating the function even further. I would propose update and overwrite as added functions.

May 26 '20 13:05 pdeffebach

I would propose update and overwrite as added functions.

What would be the exact difference between them? I was thinking of an API of the kind:

update!(df1, df2, on=..., args) and then args would be things that would specify the operations you want (so you could mix different operations you want: just add a column, replace, fill missings, whatever other ideas we have)

May 26 '20 13:05 bkamins

One only replaces if missing, the other overwrites the whole columns. Maybe they could be the same function, though.

May 26 '20 13:05 pdeffebach

I would prefer the same function and just have some nice way to specify the operation you want to do, I guess there are 4 natural operations:

fill missings
replace where second data frame has a value
fully replace (if second data frame does not have a value fill with missing)
add a column

May 26 '20 14:05 bkamins

Actually, I've just thought of a problem with this approach. Lets say you have df1 and df2 where df2 has many new columns you are interested :age, :marital_status but also compete information for some columns in df1. So df1 and df2 both have the varriable :income but :income has missing values in df1

When you do

innerjoin(df1, df2, on = :id, makeunique = true)

you get :income and :income_1. A separate update function wouldn't help with this

No need to use update after the join happens: you can clean it up yourself easily
Doing update before the join happens will still result in either an error or :income_1
You can't do just the update since then you would lose the demographic variables of interest.

This kind of scenario is an argument for an update argument in innerjoin.

May 26 '20 14:05 pdeffebach

I do not think it would be a problem, as adding demographic columns, would be a type 4 of the operation (just add columns).

So in a sense update would be a kind of select from df2 into df1. If you just pass a column - it behaves in the same way as join (except that by default nothing is added - join is a kind of transform in this sense), but you are allowed to do 3 different operations (update existing column in modes: only missing, everywhere where present and just replace old column).

May 26 '20 14:05 bkamins

In Stata terms this is the keepusing keyword argument, where you choose to keep certain variables from the using (RHS) data set.

I'm not sure I totally agree with the approach above because generally the number of columns I want to be added and not overridden is an order of magnitude larger than the columns I do want overridden. I would inevitably do update(df1, df2, on = :id, overwrite = [:income], add = :) because I don't want to specify all of the demographic variables to tack on individually.

An additional problem: We have 5 different join functions, and all of them, save maybe antijoin, could use this kind of feature. I'm worried that we will have to re-invent complicated rules for the creation of new columns.

May 26 '20 14:05 pdeffebach

Yeah, as usual - given the Stata experience - is it possible to put some proposal to start with?

May 26 '20 15:05 bkamins

The SQL syntax I find it easier (example of in-place),

UPDATE 
    t1
SET 
    t1.c1 = COALESCE(t1.c1, t2.c1)
FROM 
    t1
    INNER JOIN t2
    ON t1.id = t2.id
;

May 26 '20 15:05 Nosferican

@Nosferican this is mostly what I have imagined ideally but without metaprogramming it is hard to come up with a concise syntax, i.e. it could be something like:

update!(fun, df1, df2, on=:id)

where fun would get two data frames as an argument having only matching rows in df1 and df2 and would follow the same rules of processing output as in combine(fun, df). Then what it produces would be appended to df1 and if the produced result would have columns that match the existing column then the column would be updated in matching places.

So your example would read:

update!(t1, t2, on=:id) do (t1, t2)
    return (c1 = coalesce.(t1.c1, t2.c2),)
end

This would be almost fully flexible (it would not allow full column replacement, but it is easy enough to almost achieve it using leftjoin).

Still I am not fully happy with this (nor any other syntax I could come up with). But maybe we will find the update! I propose "good enough"?

May 26 '20 17:05 bkamins

OK. I have thought about it and here are some conclusions. Joins in DataFrames.jl work in three steps (I omit implementation details):

ROW MATCHING: appropriately resize df1 and df2 in terms of rows, drop from df2 joining columns, create two vectors indicating if given row was originally present in df1 and df2
TABLE MERGING: merge resized df1 and df2
POST PROCESSING: optionally add a column which is a transformation of indicator column

What we discuss here is essentially replacing TABLE MERGING step by a custom function. This means (I give na example on leftjoin but it applies to all other joins) that:

leftjoin(df1, df2, on=cols)

can be seen as calling

leftjoin(default_joiner, df1, df2, on=cols)

where default_joiner is a function taking 4 arguments (resized df1, resized df2, vector of row indicators for df1, vector of row indicators for df2) and performs a merge of these two data frames (by default just hcat-ing them; note that default_joiner mutates the first argument passed in place).

And actually we could expose this leftjoin(default_joiner, df1, df2, on=cols) as a public API and allow for users to provide custom joiners (where all things that are asked for can be implemented, most probably using do block).

Now if we went this way it is also natural how leftjoin! should be defined. Currently leftjoin passes to default_joiner a copy of df1. Now leftjoin! would manipulate df1 in place before passing it to default_joiner and by this the effect would be that leftjoin! would mutate df1 in place as needed.

The only problem is that leftjoin(custom_joiner, df1, df2, on=cols) is a very low level API, but maybe it is good enough for practical purposes?

May 26 '20 22:05 bkamins

I also found this feature would be much helpful in our work.

the update! indicates the contents of df1 will be refreshed by the corresponding contents of df2, which match the leftjoin, while merge! means the content of df1 and df2 should be combined, which match the outerjoin. so I think the join APIs are appropriate to include this feature.

the problems here are how to handle content conflicts. current implementation will throw error or if makeunique=true leave conflicts to user.

but for some common use cases, for example, to update/leftjoin df1, if conflicts occurs, we usually want to overwrite content of df1 with the corresponding content of df2. From a user perspective, keywords makeunique=false, overwrite=true will be clear enough to express the intentions. of course, the content here would be single cell in a spreadsheet, not the whole column. if missing is to filled as much as possible, means if df1 have missing and df2 not, missing is replaced, but if df1 not and df2 have missing, data will not be evacuated by missing, another keywords writemissing=false should be ok.

for flexibility, a relatively low-level API @bkamins mentioned may be needed, but to the most users, I think join would be a good target to incorporate this feature.

Jul 26 '20 19:07 babaq

I do not think you can do any better than joining and then combining the results. In particular in general (if if were to become a standard library function) you would have to define what should happen if either of the data frames has a duplicate entry in :id, what should happen if the right data frame has some entry, and the left data frame also contains it but it is not missing etc.

I general I believe that there are so many possible options that it is hard to define an API for such a function, simply an appropriate join should be used. Note then that the last operation you mention can be written as df_full.val .= coalesce.(df_full.val, df_full_val_1) which will be fast and update df_full.val in place.

I guess df_full_val_1 should be df_full.val_1?

If it's not rude to ask: May I know what does df_full.val_1 mean? A duplicate of df_full.val? Also, could I know the mechanism underlying df_full.val .= coalesce.(df_full.val, df_full.val_1) ? I didn't find a clue either in Julia's Doc or in DataFrames's Doc.

Thanks for your tremendous contribution to DataFrames, @bkamins !

Jun 29 '21 21:06 hongtaoh

I guess df_full_val_1 should be df_full.val_1?

Yes, it was a typo; fixed

May I know what does df_full.val_1 mean?

It gives you a vector stored as a column val_1 in data frame df_full. I have quoted it as it was created in the original question in : leftjoin(df_missing, df_completion, on = :id, makeunique = true) operation. This style is not recommended (because it can be confusing as you witness) but it is supported. The thing is that df_missing and df_completion have the same column name, so makeunique=true de-duplicates them by adding _1 suffix to the column name of the right data frame as it is explained in the docstring of leftjoin function.

Also, could I know the mechanism underlying df_full.val .= coalesce.(df_full.val, df_full.val_1)

This operation, mostly, has nothing to do with DataFrames.jl. It is a standard Julia Base broadcasting syntax as explained here in the Julia Manual, and coalesce function is documented here in the Julia Manual. Additionally, broadcasting rules specific to DataFrames.jl are documented here.

Jun 29 '21 22:06 bkamins

I guess df_full_val_1 should be df_full.val_1?

Yes, it was a typo; fixed

May I know what does df_full.val_1 mean?

It gives you a vector stored as a column val_1 in data frame df_full. I have quoted it as it was created in the original question in : leftjoin(df_missing, df_completion, on = :id, makeunique = true) operation. This style is not recommended (because it can be confusing as you witness) but it is supported. The thing is that df_missing and df_completion have the same column name, so makeunique=true de-duplicates them by adding _1 suffix to the column name of the right data frame as it is explained in the docstring of leftjoin function.

Also, could I know the mechanism underlying df_full.val .= coalesce.(df_full.val, df_full.val_1)

This operation, mostly, has nothing to do with DataFrames.jl. It is a standard Julia Base broadcasting syntax as explained here in the Julia Manual, and coalesce function is documented here in the Julia Manual. Additionally, broadcasting rules specific to DataFrames.jl are documented here.

Thank you so much @bkamins for this quick response! Even though df_full = leftjoin(df_missing, df_completion, on = :id, makeunique = true) operation is not desirable, we have to run it before running df_full.val .= coalesce.(df_full.val, df_full.val_1). Otherwise, we won't be able to get df_full in the first place.

Thank you for pointing me to the documentation for the doc syntax, coalesce and broadcasting. It helped me a lot!

Jun 30 '21 19:06 hongtaoh

OK. I think we are ready to add it in DataFrames.jl 1.5 release.

I was thinking about the signature and came up with (in-place version):

update!(df1, df2, cols...; on, matchmissing=:error, rule=:all)

(also a copying update version would be added; also maybe the order of df1 and df2 should be reversed? - more on it below)

The update! function, like leftjoin! would require that df2 has unique matching rows (so if duplicates in matching rows are present an error is thrown)

The meaning of the arguments is:

cols positional arguments specifying columns to be updated; if multiple cols arguments are passed they are processed sequentially left to right. Any cols argument can be:
- any column selector (single column, vector, Cols, Not, ...) that will be applied to df2 table; then columns from df1 with the same name will be updated (if columns are missing in df1 they are ignored)
- a pair df2_col => df1_col indicating which column from df2 should be used to update which column in df1 (again - maybe the order should be reversed; in this case both selectors must be single column selectors); if df1_col is missing error
on, matchmissing: as in other joins
rule defining how to handle the update. The options would be:
- function - takes two arguments the value from df2 and the value from df1 and returning the desired value (the first passed value will be from df2 the second from df1 as this is the director of updating - but we might want to reverse this order)
- :all - update all matching rows in df1 with rows from df2, equivalent to (x, y) -> x
- :missing - only update rows in df1 that have a missing value with a value from df2, equivalent to passing coalesce function

Now the things to discuss are:

in three places the order of df1 and df2 matters; first is update! call (if the updated table should go first or second); then there is a syntax in cols for doing column name matching; finally the rule function also depends on the order of arguments; I was not sure which order to use
are we OK to error if duplicate keys in df2 are found (I find it natural, but maybe there are cases when this is not desirable)
I proposed to ignore situation when there are columns in df2 that user wants to use that are missing in df1, but throw an error if pair notation is used. The alternative behaviors could be to have an extra kwarg with three options: 1) error, 2) ignore, 3) add new column to df1 if it is missing (and assume it had all missing values initially)

CC @nalimilan

Dec 11 '22 08:12 bkamins

@jariji - this is another important PR where API design is a key challenge.

Dec 27 '22 21:12 bkamins

There's an operation I perform a lot. I'm adding here because I think it's related. Maybe this is worth considering now that the API is under development. Alternatively, it'd be nice to consider this functionality.

Basically, the type operation is as follows

gdf = groupby(df[condition,:], groups)

result = combine(gdf, col => func => pre_alloc)
leftjoin!(df, result, on=groups)

A MWE

dff = DataFrame(industry = ["beer", "beer", "wine", "wine"], 
                is_public =[true,false, true,false], 
                firm_revenue = collect(10:10:40))

4×3 DataFrame
 Row │ industry  is_public  firm_revenue 
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ beer           true            10
   2 │ beer          false            20
   3 │ wine           true            30
   4 │ wine          false            40


gdf = groupby(dff[dff.is_public .== true,:], :industry)
result = combine(gdf, :firm_revenue => sum => :revenue_of_all_public_firms)

2×2 DataFrame
 Row │ industry  revenue_of_all_public_firms 
     │ String    Int64
─────┼───────────────────────────────────────
   1 │ beer                               10
   2 │ wine                               30

and then i "update" the original dataframe

leftjoin!(dff, result, on=:industry)
4×4 DataFrame
 Row │ industry  is_public  firm_revenue  revenue_of_all_public_firms 
     │ String    Bool       Int64         Union{Missing, Int64}
─────┼────────────────────────────────────────────────────────────────
   1 │ beer           true            10                           10
   2 │ beer          false            20                           10
   3 │ wine           true            30                           30
   4 │ wine          false            40                           30

I guess update could potentially handle this operation more efficiently than leftjoin? I was thinking of updating an empty column with the results of combine. Notice here, unlike the case updating a dataframe using a separate dataframe, we'd update values with results that come from the parent DataFrame. Usually leftjoin! is quite expensive in my work, but don't know if it's room for improvement for this particular application.

Thanks!!!

Feb 21 '23 02:02 alfaromartino

Currently you can almost do what you want efficiently with:

julia> dff = DataFrame(industry = ["beer", "beer", "wine", "wine"],
                       is_public =[true,false, true,false],
                       firm_revenue = collect(10:10:40))
4×3 DataFrame
 Row │ industry  is_public  firm_revenue
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ beer           true            10
   2 │ beer          false            20
   3 │ wine           true            30
   4 │ wine          false            40

julia> gdf = groupby(subset(dff, :is_public, view=true), :industry)
GroupedDataFrame with 2 groups based on key: industry
First Group (1 row): industry = "beer"
 Row │ industry  is_public  firm_revenue
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ beer           true            10
⋮
Last Group (1 row): industry = "wine"
 Row │ industry  is_public  firm_revenue
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ wine           true            30

julia> transform!(gdf, :firm_revenue => sum => :revenue_of_all_public_firms)
2×4 SubDataFrame
 Row │ industry  is_public  firm_revenue  revenue_of_all_public_firms
     │ String    Bool       Int64         Union{Missing, Int64}
─────┼────────────────────────────────────────────────────────────────
   1 │ beer           true            10                           10
   2 │ wine           true            30                           30

julia> dff
4×4 DataFrame
 Row │ industry  is_public  firm_revenue  revenue_of_all_public_firms
     │ String    Bool       Int64         Union{Missing, Int64}
─────┼────────────────────────────────────────────────────────────────
   1 │ beer           true            10                           10
   2 │ beer          false            20                      missing
   3 │ wine           true            30                           30
   4 │ wine          false            40                      missing

almost - because, as you can see, we will have missing in the rows that do not match the selection (this can be fixed in post-processing currently).

Another approach, but it is less general is:

julia> dff = DataFrame(industry = ["beer", "beer", "wine", "wine"],
                       is_public =[true,false, true,false],
                       firm_revenue = collect(10:10:40))
4×3 DataFrame
 Row │ industry  is_public  firm_revenue
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ beer           true            10
   2 │ beer          false            20
   3 │ wine           true            30
   4 │ wine          false            40

julia> gdf = groupby(dff, :industry)
GroupedDataFrame with 2 groups based on key: industry
First Group (2 rows): industry = "beer"
 Row │ industry  is_public  firm_revenue
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ beer           true            10
   2 │ beer          false            20
⋮
Last Group (2 rows): industry = "wine"
 Row │ industry  is_public  firm_revenue
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ wine           true            30
   2 │ wine          false            40

julia> transform!(gdf, [:firm_revenue, :is_public] => ((a, b) -> sum(a[b])) => :revenue_of_all_public_firms) # this would be shorter with DataFramesMeta.jl
4×4 DataFrame
 Row │ industry  is_public  firm_revenue  revenue_of_all_public_firms
     │ String    Bool       Int64         Int64
─────┼────────────────────────────────────────────────────────────────
   1 │ beer           true            10                           10
   2 │ beer          false            20                           10
   3 │ wine           true            30                           30
   4 │ wine          false            40                           30

Feb 21 '23 07:02 bkamins

Yes, that was my point. I use combine + leftjoin! to avoid having missing values, and because methods like the second example are less general and only applicable to specific cases. My solution is simple in terms of syntax, but really slow. That's why I thought that update could handle this.

I was thinking along the lines of using:

create view of df with some condition
create gdf with groups
use combine and get gdf_combine
update(parent_df, gdf_combine, on=groups).
It'd be like filling fill the parent df for each combination of groups.

These operations are quite common I think. Basically, you put all industry information in columns to compute each firm's result relative to industry measures (e.g., a market share, although computing market shares can obviously be done in a simpler way than all this).

Maybe update needs to handle more general cases, and this should be done through a separate function. Just wanted to comment this, so you could have it in mind.

Thanks!!!

Feb 21 '23 07:02 alfaromartino

DataFrames.jl DataFrames.jl copied to clipboard

Merge two DataFrames only to missing values

DataFrames.jl
DataFrames.jl copied to clipboard