pandas ENH: Add paramenter `index` to `drop_duplicates` to drop duplicate indices

Feature Type

[ ] Adding new functionality to pandas
[X] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas

Problem Description

There currently is no elegant pattern to drop duplicate indices.
I think what people usually do is df[~df.index.duplicated(keep='first')]

Feature Description

Add a new parameter to drop_duplicates to specify dropping duplicate indices.
An option could be a index=True to do this, similar to when merging on an index.

Alternative Solutions

Allow the subset parameter of drop_duplicates to accept the name of the index.

Additional Context

No response

May 09 '24 08:05 bingbong-sempai

Hi, I would like to work on this issue. I'll start implementing the feature and submit a PR soon.

Jun 25 '24 02:06 Yousinator

Thanks for the suggestion! I think this feature will be useful so I'm ok with it being added.

If I understood you correctly, if someone passes index=True then duplicate indices will be dropped alongside duplicate rows, but what if someone wants to drop duplicate indices only?

Jul 01 '24 22:07 Aloqeely

This is how it can look like if an index parameter is added:

Description	Parameters
Drop duplicate indices only	subset=None, index=True
Drop duplicate columns only	subset=None, index=False
	subset=colnames, index=False
Drop mixed columns and index	subset=colnames, index=True

It turned out a bit more complex than I expected. Basically index is set to True any time you want to drop duplicate indices.

A simpler alternative might be better to accept the index name in the subset parameter:

Description	Parameters
Drop duplicate indices only	subset=index.name
Drop duplicate columns only	subset=None
	subset=colnames
Drop mixed columns and index	subset=[index.name, *colnames]

Jul 02 '24 03:07 bingbong-sempai

My current implementation is to drop indices only. However, my take on an alternative would be adding two parameters:

index which drops duplicates alongside the other rows
index_only - or any other naming - which would then drop only duplicate indices.

Jul 02 '24 12:07 Yousinator

So if subset=colnames and index=True, will that drop rows that have duplicate indices and then drop duplicate rows separately or will that drop duplicate rows taking their index into consideration? e.g. row1 has index = 0, values of [1,2] -- row2 has index = 1, values of [1,2] -- row3 has index = 1, values of [0, 1] there are 2 cases here:

check duplicate indices --> removed row3 AND THEN check duplicate values --> removed row2 (Only row1 is left)
check duplicate values WITH same index --> no row removed.

I'm personally leaning towards option 2 which basically treats the index as a value of the row

Jul 02 '24 13:07 Aloqeely

@Yousinator Not a fan of having 2 parameters, that will be confusing with the existing ignore_index parameter.

Jul 02 '24 13:07 Aloqeely

@Yousinator Not a fan of having 2 parameters, that will be confusing with the existing ignore_index parameter.

My final take would be having a string value rather than a bool value for the parameter. It would be a bit confusing, but would be easier and simpler than the subset / index combination.

If we where to go for the subset / index combination, I would prefer the second option too.

If going with the second option we could rearrange the indices at the end if duplicate indices exist with different values

Jul 02 '24 13:07 Yousinator

I'm personally leaning towards option 2 which basically treats the index as a value of the row

Same, I think this is what most people would expect.

Jul 03 '24 01:07 bingbong-sempai