pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

[BUG] .drop_duplicate_columns() does not appear to drop multiple column matches

Open zbarry opened this issue 6 years ago • 8 comments

Not fully explored to confirm, but I just had to call the method twice on a DataFrame.

zbarry avatar May 29 '19 14:05 zbarry

I would like to explore and work on this

sallyhong avatar Jul 14 '19 23:07 sallyhong

After checking, this is working as intended. The function only drops the specified nth_index. It only drops one column at a time per call.

def drop_duplicate_columns(
    df: pd.DataFrame, column_name, nth_index: int = 0
)

If we would like to modify the function so that it drops multiple at once, we can follow the pd.drop_duplicates() and follow after the keep parameter.

https://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html

Another thing to consider is, are we dropping just by name duplicate only? or should we check is the column values are also duplicates?

sallyhong avatar Jul 14 '19 23:07 sallyhong

Thanks for digging in, @sallyhong!

If we would like to modify the function so that it drops multiple at once, we can follow the pd.drop_duplicates() and follow after the keep parameter.

This would definitely be a good idea. Perhaps we can change nth_index to keep, and then accept one of the following:

  • an integer (following the behaviour of nth)
  • a string (first or last)
  • None

Another thing to consider is, are we dropping just by name duplicate only? or should we check is the column values are also duplicates?

Is it a common situation to see column values that are exactly duplicate? If so, then yes, this would be useful to have. What would an API look like for this?

ericmjl avatar Jul 15 '19 00:07 ericmjl

A sample scenario I could think of using this function would be if .... the data has Person ID, Time, Value, Time, Value, Time, Value... etc. and I would would want the most recent Time, Value (or the first pair)

an integer (following the behaviour of nth) a string (first or last) None

What behavior would you want the function to return if None ?

sallyhong avatar Jul 15 '19 15:07 sallyhong

Right! That's a good example, @sallyhong.

What behavior would you want the function to return if None ?

This was a passing thought that if None, we would simply drop all duplicate columns. In your opinion, would this be too "dangerous" of an option to have? I would probably still default to keeping the first.

ericmjl avatar Jul 15 '19 19:07 ericmjl

Just a thought... And I know I'm coming in a bit late to this, but is the code the expected result we want for drop_duplicate_columns?

I know that the purpose is to find another column with the same name and drop the second instance. However, that column, while using the same name, might have different data altogether (as with the example here). This usually happens when writing a SQL query like the following:

SELECT
sum(something),
sum(something2)
from a_table

The resulting table, depending on the engine, might have both columns named as sum. I wouldn't want that second column dropped from the table if I ran drop_duplicate_columns.

However, if ALL the values in the second column with the same name match all the values in the first, then it makes sense to remove the second.

Basically, I would propose the following:

  • Add another parameter such as rename which will rename the other column to column_name_2 if the names are the same, but the values don't match.
  • Maybe even add a parameter for a threshold if most of the data matches? (i.e. if 99% of the data matches, then drop the second column)
  • Or even, to make it more clear, have a by parameter, so people can drop_duplicate_columns(by='names') or drop_duplicate_columns(by='values')

szuckerman avatar Jul 15 '19 20:07 szuckerman

Hmm, I was using it as a way to do what you were saying in the However paragraph. I can't remember what I was doing, but I think it involved a situation where there was no avoiding 2+ columns generated with different column names but identical data. My desired functionality was simply only keeping one of those columns to remove redundancies.

zbarry avatar Jul 15 '19 20:07 zbarry

Or even, to make it more clear, have a by parameter, so people can drop_duplicate_columns(by='names') or drop_duplicate_columns(by='values')

@szuckerman I really like this particular idea! It reads like plain English, and makes unambiguous by what criteria we are dropping duplicate columns.

@sallyhong and @zbarry, what do you two think about this?

ericmjl avatar Jul 15 '19 23:07 ericmjl