pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

[ENH] Multi-dataframe simultaneous merging

Open zbarry opened this issue 6 years ago • 7 comments

See: https://stackoverflow.com/questions/23668427/pandas-three-way-joining-multiple-dataframes-on-columns

Most of the methods proposed are something like a .merge().merge().merge chaining. This seems like a pretty computationally suboptimal way of doing things (as far as I can tell). If you scroll down to the "This is an ideal situation for the join method" solution, that seems like a much better way to go. Maybe there could be a pyjanitor merge_multiple function that emulates the high-level API of merge using join under the hood.

zbarry avatar Jul 11 '19 18:07 zbarry

Found this after going through some old issues ;)

I actually do this often and the most clean solution I’ve found is to have an iterator of DataFrames and use itertools.reduce to reduce using either the pandas join or merge functionality.

szuckerman avatar Jun 17 '20 14:06 szuckerman

@szuckerman do you think you could paste here a strawman sketch of how you've done it? Perhaps it could become just the solution!

ericmjl avatar Jun 18 '20 01:06 ericmjl

The reduce route would be a variant on .merge().merge(), I think. The problem that I didn’t explain very well in hindsight is that this is very inefficient for large lists of dataframes - you create a lot of unnecessary intermediates this way:

merge A, B -> AB
merge AB, C -> ABC
merge ABC, D -> ABCD
...

Imagine A,B,C,D were all large dataframes, and you can see why this would be a slow approach as many temporary dataframes are allocated and the same data is copied over and over again. Instead, with the .join() method, you can join all of them simultaneously.

zbarry avatar Jun 18 '20 22:06 zbarry

True that. MultiIndex joins are possible too, right? I ask because merge allows merging on common columns, whereas I think join assumes that the index must be identical.

Maybe the idea needs to be fleshed out a bit more. I have done this before where:

  • df1 and df2 need to be merged based on columns A, C, D, giving df12,
  • and then df12 needs to be merged with df3 on the basis of columns B, C, D, E.

(Don't ask - the definition of "a sample" was convoluted in that project...)

merge gives a very natural route, though I have yet to try this with joins, so maybe with some cleverness, it'll work too?

ericmjl avatar Jun 18 '20 23:06 ericmjl

Looking through existing issues to see if I can knock them off. This looks appealing and a little mind bending. So, joining multiple dataframes on possibly different column names? I could push a PR on this and see what the team thinks?

samukweku avatar Sep 10 '20 01:09 samukweku

Unassigned myself, so I can get feedback from others, on what the API should look like. Decided not to assume, as the original creators of this issue might have some ideas that I am not privy to.

samukweku avatar Sep 12 '20 23:09 samukweku

Going to nerd snipe @zbarry on this one. I think I'd like to hear what the scope of the function could look like.

ericmjl avatar Sep 13 '20 00:09 ericmjl