cudf
cudf copied to clipboard
[BUG] .at accessor for new column fails in cudf.pandas
Describe the bug
I was helping a user Amith with a question about cudf.pandas and the library pandapower (see RAPIDS GoAI Slack). The pandapower notebook examples trigger a bug in cudf.pandas around the .at accessor. @Jacey0 helped debug this with me during PyCon 2024. Thanks Jacey!
Steps/Code to reproduce bug
Here is a Colab notebook with a reproducer: https://colab.research.google.com/gist/bdice/f5be320dada30671015d5b53642ec19c/solution-test.ipynb
Here is the minimal code snippet needed to reproduce:
%load_ext cudf.pandas
import pandas as pd
bus = pd.DataFrame({'name': []})
entries = {'name': 'Bus 0', 'geo': 'X and Y'}
for col, val in entries.items():
print("Setting", col, "to", val, end="\n\n")
print("Before:")
print(bus, end="\n\n")
bus.at[0, col] = val # pandas adds a new row and 'geo' column' fine
print("After:")
print(bus, end="\n\n")
Error output:
Setting name to Bus 0
Before:
Empty DataFrame
Columns: [name]
Index: []
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/cudf/core/series.py in _loc_to_iloc(self, arg)
381 if (n := len(indices)) == 0:
--> 382 raise KeyError("Label scalar is out of bounds")
383 elif n == 1:
KeyError: 'Label scalar is out of bounds'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
15 frames
KeyError: 'Label scalar is out of bounds'
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
IndexError: Index out of bounds
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in __setitem__(self, key, value)
843 else:
844 key = com.apply_if_callable(key, self.obj)
--> 845 indexer = self._get_setitem_indexer(key)
846 self._has_valid_setitem_indexer(key)
847
AttributeError: '_AtIndexer' object has no attribute '_get_setitem_indexer'
Expected behavior Here is the output from plain pandas:
Setting name to Bus 0
Before:
Empty DataFrame
Columns: [name]
Index: []
After:
name
0 Bus 0
Setting geo to X and Y
Before:
name
0 Bus 0
After:
name geo
0 Bus 0 X and Y
It appears the cudf.pandas bug is that _AtIndexer (and _iAtIndexer) indexer are not defined in _wrappers/pandas.py as proxy objects.
And of course, it appears cudf.DataFrame.loc.__setitem__ that expands is also buggy
More investigation notes:
So cudf.DataFrame.at is just returns cudf.DataFrame.loc, so a pandas.DataFrame.at call will always succeed to return cudf.pandas' intermediate proxy of pandas.DataFrame.loc.
So if pandas.DataFrame.at.__setitem__ fails in cudf.pandas, we rewind to pandas.DataFrame.loc.__setitem__ instead of pandas.DataFrame.at.__setitem__.
I think we need pandas.DataFrame.at to be an IntermediateProxy of an IntermediateProxy as "proper" fix. I'm not sure if there's an straightforward way to do this today.
Alternatively, if we make cudf.DataFrame.at not map to cudf.DataFrame.loc (i.e. return a different object with the same implementation), that would be an easier fix...
Alternatively, if we make
cudf.DataFrame.atnot map tocudf.DataFrame.loc(i.e. return a different object with the same implementation), that would be an easier fix...
That sounds like the right thing to do. cudf's object model should mirror that of pandas. If at returns loc in cudf but not pandas, that's a problem in the mirrored object model.