pandas icon indicating copy to clipboard operation
pandas copied to clipboard

ENH: allow preserving one of the indexes when merging two DataFrames

Open multimeric opened this issue 3 years ago • 4 comments
trafficstars

Is your feature request related to a problem?

I want to be able to merge two DataFrames, but keep the index of the left one in the final result:

>>> import pandas as pd
>>> import string
>>> df1 = pd.DataFrame({"a": range(5), "b": range(10, 15)}, index=list(string.ascii_lowercase[:5]))
>>> df2 = pd.DataFrame({"a": range(5), "c": list(string.ascii_uppercase[:5])})
>>> df1
   a   b
a  0  10
b  1  11
c  2  12
d  3  13
e  4  14
>>> df2
   a  c
0  0  A
1  1  B
2  2  C
3  3  D
4  4  E

The current merge behaviour is to just drop the index entirely:

>>> df1.merge(df2, on="a")
   a   b  c
0  0  10  A
1  1  11  B
2  2  12  C
3  3  13  D
4  4  14  E

Describe the solution you'd like

We add a new parameter preserve_index to merge, which takes either "left", "right", or None

DataFrame.merge(preserve_index="left")

In my above example, this would work like:

>>> df1.merge(df2, on="a", preserve_index="left")
   a   b  c
a  0  10  A
b  1  11  B
c  2  12  C
d  3  13  D
e  4  14  E

API breaking implications

None. This is a new parameter, and if it is not provided the API is identical.

Describe alternatives you've considered

It is already possible to work around this by resetting the index and then setting it as an index again, as described here but this is:

  • More verbose
  • Not intuitive or clear to users (hence the StackOverflow question's popularity)
  • Probably less efficient

multimeric avatar Apr 27 '22 05:04 multimeric

isn't it just as easy to use df1.merge(df2, on="a").set_index("a")? Otherwise we risk introducing features that need to be maintained and tested with further developments when these method already exist?

edit: Now i see the end of your post, ok, but im -1 on this.

attack68 avatar Apr 29 '22 05:04 attack68

You also have to reset the index to ensure it's a column, and I think the three points above show enough merit to make this worthwhile. A chain of 3 methods versus one method and one parameter is a big improvement.

multimeric avatar Apr 29 '22 05:04 multimeric

take

Mehgarg avatar Jul 15 '22 02:07 Mehgarg

@multimeric its fair to give a full response on this since you raise sensible points.

The pandas API is large (too large). My general approach is to not add any args / methods that perform functions that can already be performed. In fact I am in favour of selectively removing / reducing args when multiple ways of performing tasks exist. And my PRs reflect this philosophy.

Probably less efficient

In the long run this has the advantage of making code more maintainable for developers, and likely improves performance since those core methods can be optimised for general tasks as opposed to optimising selective and individual cases, or specific ways to handle args. This is important for the longevity, and future development of pandas.

More verbose

This is subjective. Personally I strive for an atomised code construction. In software development I prefer using core methods rather than subtle args to avoid the operational risk of arg deprecation. merge and set_index are core methods so are unlikely to be restructured, so I would favour chaining these, especially where merge is such a complex method in terms of combinatorial challenges.

Not intuitive or clear to users

Fully agree. I think use cases like this and adding to documention and cookbooks are valuable and we should work to provide better examples that users can copy, in the knowledge that pandas teams offers confidence that it is the "most efficient" way. This is a development item and something we need to do better.

Sorry I don't support your idea, hope you appreciate my feedback.

attack68 avatar Aug 05 '22 08:08 attack68