gspread-dataframe icon indicating copy to clipboard operation
gspread-dataframe copied to clipboard

Allow option in set_with_dataframe and get_as_dataframe to omit redundant higher-level labels in MultiIndex index columns

Open robin900 opened this issue 4 years ago • 7 comments

From an earlier issue #20 , @sam-s said:

In this case, I would like to have a separate column (with a header!) for each level.

I am torn between

foo one
foo two

and

foo one
    two

I think there should be an option to select one of them, the default being the 1st variant. The same goes for column indexes.

robin900 avatar Apr 02 '20 20:04 robin900

Just for clarity -- the 2nd variant is NOT an empty cell under foo but a double cell with foo.

sam-s avatar Apr 02 '20 21:04 sam-s

OK then, let's re-define this as:

Variant 2 - use empty values to avoid redundant higher-level labels on consecutive rows. Thus "empty cell under foo".

Variant 3 - merge all the cells in a higher-level column with same higher-level label. Thus "double cell with foo". Problem with this is that it involves merging cells, which is a separate sheets API mergeCells request for each group of cells to be merged. They can all be put into a single batchUpdate request, but it's a whole lot of bookkeeping to perform. Can't promise this variant would be implemented anytime soon.

Incidentally, what would you like the new parameter to set_with_dataframe to be called to select this/these new variants?

robin900 avatar Apr 02 '20 22:04 robin900

I suggest something like MultiIndex or handle_MultiIndex with descriptive str values as is common in pandas.

sam-s avatar Apr 02 '20 23:04 sam-s

Here's a sample DataFrame, posted by @sam-s and copied here by me:

Describe the bug

Uploading a DataFrame with MultiIndex columns should merge cells appropriately

Version and Environment Version of package: 0.3.7 Python interpreter:

Python 3.9.1 (default, Dec 10 2020, 10:36:35) 
[Clang 12.0.0 (clang-1200.0.32.27)] on darwin

and Python 3.6.9 on Linux 5.4.67 OS: MacOS, Linux

To Reproduce

df = pd.DataFrame({("a","x"):[1,2,3],("a","y"):[4,5,6],("b","x"):[7,8,9],("b","y"):[1,3,5],"i":[11,11,22],"j":[1,2,2]})
df.set_index(keys=["i","j"],verify_integrity=True, inplace=True)
df.columns = pd.MultiIndex.from_tuples(df.columns)

now, df is

      a     b   
      x  y  x  y
i  j            
11 1  1  4  7  1
   2  2  5  8  3
22 2  3  6  9  5

now upload it to google using set_with_dataframe.

Expected behavior

I expect a and b to appear on the screen once, just like in the printed representation above.

Screenshots

here is what I get:

image

here is what I want:

image

Desktop (please complete the following information):

  • OS: MacOS
  • Browser chrome
  • Version 87

robin900 avatar Dec 18 '20 20:12 robin900

I'm going to write out the specs for this requested feature below:

For both set_with_dataframe and also get_as_dataframe, a new parameter called handle_MultiIndex indicates how columns participating in a MultiIndex are either written or parsed. The allowed parameter values are:

  • 'repeat' indicates that columns for upper levels in a MultiIndex should repeat the level values in every cell. This is the default value.
  • 'blank' indicates that columns for upper levels in a MultiIndex should leave empty all but the first cell showing a particular value for that level.
  • 'merge' indicates that columns for upper levels in a MultiIndex should merge all cells sharing a particular value for the level.

Parsing behavior in get_as_dataframe will interpret worksheet data in the same manner for both 'blank' and 'merge' modes, since when cells are merged the first/leftmost/topmost cell contains the value and the remaining cells in the merge have an empty value. Repeated values in an upper-level column will each be read as distinct values for the MultiIndex level if the parameter value is 'blank' or 'merge'. To achieve safe round-trip travel for values between a Google worksheet and a DataFrame, use the same value for handle_MultiIndex in both the set_with_dataframe and get_as_dataframe function calls.

robin900 avatar Nov 30 '21 16:11 robin900

An update, since I've been spending a little time on this feature. The work that remains is to have get_as_dataframe() correctly build a MultiIndex for both a multiple-column index (indexcols=[...] in pandas reader) and multiple-row column header (header=[...] in pandas reader). When handle_MultiIndex is either 'blank' or 'merge', the parsing logic will need either to repair a MultiIndex riddled with empty level values, or fill in the empty values before passing the data to pandas for parsing. For row indexes, fixing after the fact might work; but for column headers, it seems less likely to work.

robin900 avatar Mar 02 '22 20:03 robin900

Note that the proper representation of MultiIndex columns in #44 is a separate issue that is nonetheless related, because header cells for MultiIndex columns will need to be blanked and cell-merged properly after the adjustments of #44 are implemented.

robin900 avatar Mar 15 '22 16:03 robin900