arkouda
arkouda copied to clipboard
Match `Series` indexing behavior to Pandas
Backing issue: #2898
After 1/8 and 1/10 discussions, implementing core indexing methods for Series
objects to match Pandas behavior. Methods to implement:
- [x]
__getitem__
- [x]
__setitem__
- [x]
loc
- [x]
iloc
- [x]
at
- [x]
iat
Part of the challenge here will be supporting the bracket-based argument lists for the latter four. These "methods" will actually probably need to be fields of the Series class that return a special locator object. The locator object will have the functionality implemented in their []
operators.
Right now, the __getitem__
has the most basic of support, but will need to be expanded to support row label types (strings, ints, floats), lists and arrays of row label types, and lists and arrays of bools.
I've got first passes at the two methods for the []
operator tested and implemented. There's at least one case not yet implemented for __setitem__
, which is when some keys are already in the Series and others are new:
- [x] Test
__setitem__
for list of keys where some keys are already in the Series and others are new - [x] Implement
__setitem__
for list of keys where some keys are already in the Series and others are new
Next are loc
and at
, which for Series, I'm pretty sure are just aliases for the []
operator. This will need more vigorous confirmation, but I think that's beyond the scope of this issue.
Like loc
and at
, because Series are one-dimensional, iloc
and iat
should implement the same behavior. This behavior is not the same as the []
operator, as these are position-based indexing rather than label-based.
Just found another mismatch, for __setitem__
. When the rhs is a list of length one, and the key is a repeated label, the single value in the list is assigned to all entries with the given key.
- [x] rhs list, repeated label key
Found a mismatch with likely performance implications. Consider the following Series and access to it.
s = Series(data=[0,1,2,3,4], index=['a','b','c','d','c'])
access = s[['c','d']]
In the implementation I wrote, the values of access
are [2,3,4]
and the labels are ['c','d','c']
. The arguments to []
are used to calculate a mask.
In Pandas, the values of access
are [2,4,3]
and the labels are ['c','c','d']
. Here, it seems to concatenate the results of the scalar accesses for each of the index arguments.