ENH: Add paramenter `index` to `drop_duplicates` to drop duplicate indices
Feature Type
-
[ ] Adding new functionality to pandas
-
[X] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
There currently is no elegant pattern to drop duplicate indices.
I think what people usually do is
df[~df.index.duplicated(keep='first')]
Feature Description
Add a new parameter to drop_duplicates to specify dropping duplicate indices.
An option could be a index=True to do this, similar to when merging on an index.
Alternative Solutions
Allow the subset parameter of drop_duplicates to accept the name of the index.
Additional Context
No response
Hi, I would like to work on this issue. I'll start implementing the feature and submit a PR soon.
Thanks for the suggestion! I think this feature will be useful so I'm ok with it being added.
If I understood you correctly, if someone passes index=True then duplicate indices will be dropped alongside duplicate rows, but what if someone wants to drop duplicate indices only?
This is how it can look like if an index parameter is added:
| Description | Parameters |
|---|---|
| Drop duplicate indices only | subset=None, index=True |
| Drop duplicate columns only | subset=None, index=False |
| subset=colnames, index=False | |
| Drop mixed columns and index | subset=colnames, index=True |
It turned out a bit more complex than I expected. Basically index is set to True any time you want to drop duplicate indices.
A simpler alternative might be better to accept the index name in the subset parameter:
| Description | Parameters |
|---|---|
| Drop duplicate indices only | subset=index.name |
| Drop duplicate columns only | subset=None |
| subset=colnames | |
| Drop mixed columns and index | subset=[index.name, *colnames] |
My current implementation is to drop indices only. However, my take on an alternative would be adding two parameters:
indexwhich drops duplicates alongside the other rowsindex_only- or any other naming - which would then drop only duplicate indices.
So if subset=colnames and index=True, will that drop rows that have duplicate indices and then drop duplicate rows separately or will that drop duplicate rows taking their index into consideration?
e.g. row1 has index = 0, values of [1,2] -- row2 has index = 1, values of [1,2] -- row3 has index = 1, values of [0, 1]
there are 2 cases here:
- check duplicate indices --> removed row3 AND THEN check duplicate values --> removed row2 (Only row1 is left)
- check duplicate values WITH same index --> no row removed.
I'm personally leaning towards option 2 which basically treats the index as a value of the row
@Yousinator Not a fan of having 2 parameters, that will be confusing with the existing ignore_index parameter.
@Yousinator Not a fan of having 2 parameters, that will be confusing with the existing
ignore_indexparameter.
My final take would be having a string value rather than a bool value for the parameter. It would be a bit confusing, but would be easier and simpler than the subset / index combination.
If we where to go for the subset / index combination, I would prefer the second option too.
If going with the second option we could rearrange the indices at the end if duplicate indices exist with different values
I'm personally leaning towards option 2 which basically treats the index as a value of the row
Same, I think this is what most people would expect.