[BUG] .drop_duplicate_columns() does not appear to drop multiple column matches
Not fully explored to confirm, but I just had to call the method twice on a DataFrame.
I would like to explore and work on this
After checking, this is working as intended.
The function only drops the specified nth_index. It only drops one column at a time per call.
def drop_duplicate_columns(
df: pd.DataFrame, column_name, nth_index: int = 0
)
If we would like to modify the function so that it drops multiple at once, we can follow the pd.drop_duplicates() and follow after the keep parameter.
https://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html
Another thing to consider is, are we dropping just by name duplicate only? or should we check is the column values are also duplicates?
Thanks for digging in, @sallyhong!
If we would like to modify the function so that it drops multiple at once, we can follow the pd.drop_duplicates() and follow after the keep parameter.
This would definitely be a good idea. Perhaps we can change nth_index to keep, and then accept one of the following:
- an integer (following the behaviour of
nth) - a string (
firstorlast) - None
Another thing to consider is, are we dropping just by name duplicate only? or should we check is the column values are also duplicates?
Is it a common situation to see column values that are exactly duplicate? If so, then yes, this would be useful to have. What would an API look like for this?
A sample scenario I could think of using this function would be if .... the data has
Person ID, Time, Value, Time, Value, Time, Value... etc.
and I would would want the most recent Time, Value (or the first pair)
an integer (following the behaviour of nth) a string (first or last) None
What behavior would you want the function to return if None ?
Right! That's a good example, @sallyhong.
What behavior would you want the function to return if
None?
This was a passing thought that if None, we would simply drop all duplicate columns. In your opinion, would this be too "dangerous" of an option to have? I would probably still default to keeping the first.
Just a thought... And I know I'm coming in a bit late to this, but is the code the expected result we want for drop_duplicate_columns?
I know that the purpose is to find another column with the same name and drop the second instance. However, that column, while using the same name, might have different data altogether (as with the example here). This usually happens when writing a SQL query like the following:
SELECT
sum(something),
sum(something2)
from a_table
The resulting table, depending on the engine, might have both columns named as sum. I wouldn't want that second column dropped from the table if I ran drop_duplicate_columns.
However, if ALL the values in the second column with the same name match all the values in the first, then it makes sense to remove the second.
Basically, I would propose the following:
- Add another parameter such as
renamewhich will rename the other column tocolumn_name_2if the names are the same, but the values don't match. - Maybe even add a parameter for a threshold if most of the data matches? (i.e. if 99% of the data matches, then drop the second column)
- Or even, to make it more clear, have a
byparameter, so people candrop_duplicate_columns(by='names')ordrop_duplicate_columns(by='values')
Hmm, I was using it as a way to do what you were saying in the However paragraph. I can't remember what I was doing, but I think it involved a situation where there was no avoiding 2+ columns generated with different column names but identical data. My desired functionality was simply only keeping one of those columns to remove redundancies.
Or even, to make it more clear, have a by parameter, so people can drop_duplicate_columns(by='names') or drop_duplicate_columns(by='values')
@szuckerman I really like this particular idea! It reads like plain English, and makes unambiguous by what criteria we are dropping duplicate columns.
@sallyhong and @zbarry, what do you two think about this?