modin icon indicating copy to clipboard operation
modin copied to clipboard

Calling df.loc with multiple arguments results in KeyError

Open naren-ponder opened this issue 3 years ago • 7 comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS Version 11.6.4
  • Modin version (modin.__version__): 0.14.0
  • Python version: Python 3.8.11
  • Code we can use to reproduce:
import modin.pandas as pd
import numpy as np

arrays = [
    np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
    np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df.loc['bar', 'one']

Resulting Error:

KeyError                                  Traceback (most recent call last)
<ipython-input-7-5557f8ed36a3> in <module>
----> 1 df.loc['bar', 'one']

~/Desktop/modin/modin/pandas/indexing.py in __getitem__(self, key)
    636             return self._handle_boolean_masking(row_loc, col_loc)
    637 
--> 638         row_lookup, col_lookup = self._compute_lookup(row_loc, col_loc)
    639         result = super(_LocIndexer, self).__getitem__(row_lookup, col_lookup, ndim)
    640         if isinstance(result, Series):

~/Desktop/modin/modin/pandas/indexing.py in _compute_lookup(self, row_loc, col_loc)
    843                         else axis_loc
    844                     )
--> 845                     raise KeyError(missing_labels)
    846 
    847             if isinstance(axis_lookup, pandas.Index) and not is_range_like(axis_lookup):

KeyError: array(['one'], dtype='<U3')

Expected Output (with pandas):

0    0.395674
1   -0.426304
2    0.273483
3   -0.702982
Name: (bar, one), dtype: float64

Describe the problem

Calling df.loc with multiple arguments results in Modin believing there are missing labels and therefore raises a KeyError.

Source code / logs

naren-ponder avatar Mar 30 '22 16:03 naren-ponder

@naren-ponder do you find the behavior strange? It would be more expected if it would be necessary to explicitly pass the tuple to work with the multi-index, like df.loc[(bar, one)].

If this behavior is wrong in pandas itself, maybe we should not repeat it?

anmyachev avatar Mar 31 '22 16:03 anmyachev

@anmyachev The "expected output" section I indicated above is what happens when you run that snippet of code with pandas. So given that we want to mirror the pandas behavior, I think this is a bug that should be fixed. Perhaps I am misunderstanding your question?

naren-ponder avatar Mar 31 '22 16:03 naren-ponder

@naren-ponder In general you are right. But it seemed to me that there was already a precedent when we issued a warning for users that Modin's behavior in such and such a case does not coincide with the behavior of pandas, because the behavior of pandas is erroneous. @modin-project/modin-core do you remember this case? Or am I confusing something?

anmyachev avatar Mar 31 '22 16:03 anmyachev

The behavior of pandas in this case is not erroneous, I looked at the docs. So we definitely need to fix the case.

However, the previous question is still relevant.

anmyachev avatar Apr 01 '22 13:04 anmyachev

I got the same error, thus upvoting this issue.

alvin-chang avatar Apr 03 '22 05:04 alvin-chang

@anmyachev, if Modin behavior does not match the pandas behavior, we issue a warning like this. https://github.com/modin-project/modin/blob/f41432c1c746c6a6186c376594c6c7f7dd24cdb5/modin/core/storage_formats/pandas/query_compiler.py#L2038

YarShev avatar Apr 05 '22 19:04 YarShev

@alvin-chang An easy workaround for this issue would be to separate out the calls to .loc. For instance in the case listed above you could do df.loc['bar'].loc['one']. This should unblock you while we work towards putting in a fix.

naren-ponder avatar Apr 20 '22 21:04 naren-ponder

This works at version 80c7891de7b6754a08d886895a110c2512c88e89.

mvashishtha avatar Jun 29 '23 19:06 mvashishtha