pandas icon indicating copy to clipboard operation
pandas copied to clipboard

ENH: Add paramenter `index` to `drop_duplicates` to drop duplicate indices

Open bingbong-sempai opened this issue 1 year ago • 1 comments

Feature Type

  • [ ] Adding new functionality to pandas

  • [X] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

There currently is no elegant pattern to drop duplicate indices.
I think what people usually do is df[~df.index.duplicated(keep='first')]

Feature Description

Add a new parameter to drop_duplicates to specify dropping duplicate indices.
An option could be a index=True to do this, similar to when merging on an index.

Alternative Solutions

Allow the subset parameter of drop_duplicates to accept the name of the index.

Additional Context

No response

bingbong-sempai avatar May 09 '24 08:05 bingbong-sempai

Hi, I would like to work on this issue. I'll start implementing the feature and submit a PR soon.

Yousinator avatar Jun 25 '24 02:06 Yousinator

Thanks for the suggestion! I think this feature will be useful so I'm ok with it being added.

If I understood you correctly, if someone passes index=True then duplicate indices will be dropped alongside duplicate rows, but what if someone wants to drop duplicate indices only?

Aloqeely avatar Jul 01 '24 22:07 Aloqeely

This is how it can look like if an index parameter is added:

Description Parameters
Drop duplicate indices only subset=None, index=True
Drop duplicate columns only subset=None, index=False
  subset=colnames, index=False
Drop mixed columns and index subset=colnames, index=True

It turned out a bit more complex than I expected. Basically index is set to True any time you want to drop duplicate indices.

A simpler alternative might be better to accept the index name in the subset parameter:

Description Parameters
Drop duplicate indices only subset=index.name
Drop duplicate columns only subset=None
  subset=colnames
Drop mixed columns and index subset=[index.name, *colnames]

bingbong-sempai avatar Jul 02 '24 03:07 bingbong-sempai

My current implementation is to drop indices only. However, my take on an alternative would be adding two parameters:

  1. index which drops duplicates alongside the other rows
  2. index_only - or any other naming - which would then drop only duplicate indices.

Yousinator avatar Jul 02 '24 12:07 Yousinator

So if subset=colnames and index=True, will that drop rows that have duplicate indices and then drop duplicate rows separately or will that drop duplicate rows taking their index into consideration? e.g. row1 has index = 0, values of [1,2] -- row2 has index = 1, values of [1,2] -- row3 has index = 1, values of [0, 1] there are 2 cases here:

  1. check duplicate indices --> removed row3 AND THEN check duplicate values --> removed row2 (Only row1 is left)
  2. check duplicate values WITH same index --> no row removed.

I'm personally leaning towards option 2 which basically treats the index as a value of the row

Aloqeely avatar Jul 02 '24 13:07 Aloqeely

@Yousinator Not a fan of having 2 parameters, that will be confusing with the existing ignore_index parameter.

Aloqeely avatar Jul 02 '24 13:07 Aloqeely

@Yousinator Not a fan of having 2 parameters, that will be confusing with the existing ignore_index parameter.

My final take would be having a string value rather than a bool value for the parameter. It would be a bit confusing, but would be easier and simpler than the subset / index combination.

If we where to go for the subset / index combination, I would prefer the second option too.

If going with the second option we could rearrange the indices at the end if duplicate indices exist with different values

Yousinator avatar Jul 02 '24 13:07 Yousinator

I'm personally leaning towards option 2 which basically treats the index as a value of the row

Same, I think this is what most people would expect.

bingbong-sempai avatar Jul 03 '24 01:07 bingbong-sempai