cudf Remove partial support for duplicate MultIindex names unless they are all None

Currently our MultiIndex class supports duplicate names, while DataFrames do not. The MultiIndex support is buggy, however, and we are frequently finding new edge cases that break it. Since pandas DataFrames do support duplicate names and we explicitly choose not to, I think it makes sense to do the same for MultiIndex. It improves our internal consistency and helps us write much more robust code. Making this change would probably fix a number of currently unknown/hidden bugs.

The major caveat here is that we do need to support MultiIndexes where all the names are None. However, handling this case would potentially be much simpler since we could use a sentinel or another class attribute to track whether names are meaningful or not. Default names could be integers, and any setting of names would require setting all column names to unique values.

An aside: If we ever did want to support duplicate names properly, it would involve a refactoring at the level of ColumnAccessor, which currently uses a dictionary as the underlying data structure to map names to columns. We would then need to update all of our functions that rely on _from_data to populate a new object that could support duplicate names rather than a dictionary. This is a substantial undertaking and out of scope for this issue.

Mar 23 '22 19:03 vyasr

CC @shwina @galipremsagar

Mar 23 '22 19:03 vyasr

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Apr 22 '22 20:04 github-actions[bot]

The alternative resolution to this issue would be if we changed the ColumnAccessor to use a list of columns and a list of names instead of maintaining a dictionary of name:column mappings. If we made that change, then both DataFrame and MultiIndex could support duplicate names with only a little bit of additional logic to ensure that accessor methods behave in the expected way when duplicate names exist. However, any such changes to the ColumnAccessor will be motivated by more pressing concerns than supporting duplicate names (overall package performance, stability, and robustness), so we shouldn't rush to any solutions just to solve the duplicate names problem.

Apr 25 '22 14:04 vyasr

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

May 25 '22 15:05 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Aug 23 '22 16:08 github-actions[bot]

@mroeschke do you know if there are any plans to change the duplicate names support in pandas? There are a lot of ways in which it's kind of broken to allow this since basic operations stop working if there are duplicate names, so this seems like an API improvement that we could suggest in pandas itself (excepting the MultiIndex all None case; we may still need to support that since a MultiIndex without names is probably the most common case).

May 17 '24 14:05 vyasr

We could propose disallowing duplicate names, but I doubt there would be much appetite to disallow them.

I don't recall seeing many bug reports over the years because a MultiIndex had duplicate names as the names are essentially metadata (carried around as a immutable list) and not used in any meaningful way in MultiIndex operations.

May 17 '24 16:05 mroeschke

Sorry, I should clarify. I wasn't only thinking about MultiIndex objects, but also DataFrame objects. For example, pandas lets you do this:

In [1]: import pandas as pd       
                                                   
In [2]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
                                                   
In [3]: df.columns = ["a", "a", "b"]
                                                                                                                                                                                                            In [4]: df     
Out[4]:                     
   a  a  b                 
0  1  4  7                                         
1  2  5  8          
2  3  6  9

That certainly has impacts on various downstream operations and leads to odd-looking failures, e.g.

In [5]: df.groupby("a").sum()
...
ValueError: Grouper for 'a' not 1-dimensional

May 20 '24 23:05 vyasr

Ah I see. I think this would be tough sell too since a lot of APIs were developed overtime to handle duplicate columns (I suspect the main motivation was to "gracefully" support IO usecases (CSVs) with duplicate headers).

There has been an ask to make column labels unique by default https://github.com/pandas-dev/pandas/issues/53217, but also a larger discussion at one point to make handling duplicate columns consistent https://github.com/pandas-dev/pandas/issues/47718 so I think there's greater appetite at the behavior consistency of duplicate labels rather than disallowing them

May 21 '24 00:05 mroeschke

OK got it. That is very helpful context, thanks! If that is the case and there is real interest in this in pandas, then we may have to rethink cudf's plans around duplicate names and issues like #13273

May 21 '24 00:05 vyasr