modin icon indicating copy to clipboard operation
modin copied to clipboard

How to use `.loc` in case of multi-indexed dataframes?

Open Eisbrenner opened this issue 4 years ago • 5 comments

  • Python 3.8.2
  • Modin 0.8.2
  • Pandas 1.1.4

Question

Hi,

in the below example, the version with Modin throws an error when indexing a multi-index DataFrame, while pandas itself doesn't.

def example():
    df = pd.DataFrame(
        [["bar", 1, "1"], ["bar", 2, "2"], ["foo", 1, "3"], ["foo", 2, "4"]],
        columns=["first", "second", "data"],
    )
    df = df.set_index(["first", "second"])
    print(df.loc[("bar"), slice(None), :])
import pandas as pd
example()
#                data
# first  second     
# bar    1       1
#        2       2
from modin import pandas as pd
example()
# ...
# IndexingError: Too many indexers

What would be the right way of indexing using Modin?

Full error log

---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
~/path/script.py in <module>
      9 example()
     10 from modin import pandas as pd
---> 11 example()

~/path/script.py in example()
      5     )
      6     df = df.set_index(["first", "second"])
----> 7     print(df.loc[("bar"), slice(None), :])
      8 import pandas as pd
      9 example()

~/path/.venv/lib/python3.8/site-packages/modin/pandas/indexing.py in __getitem__(self, key)
    509         if callable(key):
    510             return self.__getitem__(key(self.df))
--> 511         row_loc, col_loc, ndim, self.row_scaler, self.col_scaler = _parse_tuple(key)
    512         if isinstance(row_loc, slice) and row_loc == slice(None):
    513             # If we're only slicing columns, handle the case with `__getitem__`

~/path/.venv/lib/python3.8/site-packages/modin/pandas/indexing.py in _parse_tuple(tup)
    207             col_loc = tup[1]
    208         if len(tup) > 2:
--> 209             raise IndexingError("Too many indexers")
    210     else:
    211         row_loc = tup

IndexingError: Too many indexers

Eisbrenner avatar Jan 04 '21 17:01 Eisbrenner

There is a bug in current loc implementation. It gets three parameters and doesn't know how to handle these.

gshimansky avatar Jan 05 '21 20:01 gshimansky

As a temporary workaround you could write a row tuple explicitly so that loc gets two arguments instead of three. This works df.loc[("bar", slice(None)), :].

gshimansky avatar Jan 05 '21 20:01 gshimansky

The following tuples used in df.loc don't work in Modin either and should be included in loc tests:

  • df.loc[("bar", 1)]
  • df.loc[("bar", 1) :]
  • df.loc[("bar"), 1]
  • df.loc[("bar"), 1, :]

gshimansky avatar Jan 07 '21 17:01 gshimansky

This issue has been mentioned on Modin Discuss. There might be relevant details there:

https://discuss.modin.org/t/support-for-multi-index/193/1

modin-bot avatar Mar 09 '21 16:03 modin-bot

I am still able to reproduce this on the latest master.

pyrito avatar Aug 22 '22 22:08 pyrito