gspread-dataframe
gspread-dataframe copied to clipboard
Allow option in set_with_dataframe and get_as_dataframe to omit redundant higher-level labels in MultiIndex index columns
From an earlier issue #20 , @sam-s said:
In this case, I would like to have a separate column (with a header!) for each level.
I am torn between
foo one
foo two
and
foo one
two
I think there should be an option to select one of them, the default being the 1st variant. The same goes for column indexes.
Just for clarity -- the 2nd variant is NOT an empty cell under foo
but a double cell with foo
.
OK then, let's re-define this as:
Variant 2 - use empty values to avoid redundant higher-level labels on consecutive rows. Thus "empty cell under foo".
Variant 3 - merge all the cells in a higher-level column with same higher-level label. Thus "double cell with foo". Problem with this is that it involves merging cells, which is a separate sheets API mergeCells
request for each group of cells to be merged. They can all be put into a single batchUpdate
request, but it's a whole lot of bookkeeping to perform. Can't promise this variant would be implemented anytime soon.
Incidentally, what would you like the new parameter to set_with_dataframe
to be called to select this/these new variants?
I suggest something like MultiIndex
or handle_MultiIndex
with descriptive str
values as is common in pandas
.
Here's a sample DataFrame, posted by @sam-s and copied here by me:
Describe the bug
Uploading a DataFrame
with MultiIndex
columns should merge cells appropriately
Version and Environment Version of package: 0.3.7 Python interpreter:
Python 3.9.1 (default, Dec 10 2020, 10:36:35)
[Clang 12.0.0 (clang-1200.0.32.27)] on darwin
and Python 3.6.9
on Linux 5.4.67
OS: MacOS, Linux
To Reproduce
df = pd.DataFrame({("a","x"):[1,2,3],("a","y"):[4,5,6],("b","x"):[7,8,9],("b","y"):[1,3,5],"i":[11,11,22],"j":[1,2,2]})
df.set_index(keys=["i","j"],verify_integrity=True, inplace=True)
df.columns = pd.MultiIndex.from_tuples(df.columns)
now, df
is
a b
x y x y
i j
11 1 1 4 7 1
2 2 5 8 3
22 2 3 6 9 5
now upload it to google using set_with_dataframe
.
Expected behavior
I expect a
and b
to appear on the screen once, just like in the printed representation above.
Screenshots
here is what I get:
here is what I want:
Desktop (please complete the following information):
- OS: MacOS
- Browser chrome
- Version 87
I'm going to write out the specs for this requested feature below:
For both set_with_dataframe
and also get_as_dataframe
, a new parameter called handle_MultiIndex
indicates how columns participating in a MultiIndex are either written or parsed. The allowed parameter values are:
-
'repeat'
indicates that columns for upper levels in a MultiIndex should repeat the level values in every cell. This is the default value. -
'blank'
indicates that columns for upper levels in a MultiIndex should leave empty all but the first cell showing a particular value for that level. -
'merge'
indicates that columns for upper levels in a MultiIndex should merge all cells sharing a particular value for the level.
Parsing behavior in get_as_dataframe
will interpret worksheet data in the same manner for both 'blank'
and 'merge'
modes, since when cells are merged the first/leftmost/topmost cell contains the value and the remaining cells in the merge have an empty value. Repeated values in an upper-level column will each be read as distinct values for the MultiIndex level if the parameter value is 'blank'
or 'merge'
. To achieve safe round-trip travel for values between a Google worksheet and a DataFrame, use the same value for handle_MultiIndex
in both the set_with_dataframe
and get_as_dataframe
function calls.
An update, since I've been spending a little time on this feature. The work that remains is to have get_as_dataframe()
correctly build a MultiIndex for both a multiple-column index (indexcols=[...]
in pandas reader) and multiple-row column header (header=[...]
in pandas reader). When handle_MultiIndex
is either 'blank'
or 'merge'
, the parsing logic will need either to repair a MultiIndex riddled with empty level values, or fill in the empty values before passing the data to pandas for parsing. For row indexes, fixing after the fact might work; but for column headers, it seems less likely to work.
Note that the proper representation of MultiIndex columns in #44 is a separate issue that is nonetheless related, because header cells for MultiIndex columns will need to be blanked and cell-merged properly after the adjustments of #44 are implemented.