arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

Match `Series` indexing behavior to Pandas

Open brandon-neth opened this issue 1 year ago • 3 comments

Backing issue: #2898

After 1/8 and 1/10 discussions, implementing core indexing methods for Series objects to match Pandas behavior. Methods to implement:

  • [x] __getitem__
  • [x] __setitem__
  • [x] loc
  • [x] iloc
  • [x] at
  • [x] iat

Part of the challenge here will be supporting the bracket-based argument lists for the latter four. These "methods" will actually probably need to be fields of the Series class that return a special locator object. The locator object will have the functionality implemented in their [] operators.

Right now, the __getitem__ has the most basic of support, but will need to be expanded to support row label types (strings, ints, floats), lists and arrays of row label types, and lists and arrays of bools.

brandon-neth avatar Jan 10 '24 20:01 brandon-neth

I've got first passes at the two methods for the [] operator tested and implemented. There's at least one case not yet implemented for __setitem__, which is when some keys are already in the Series and others are new:

  • [x] Test __setitem__ for list of keys where some keys are already in the Series and others are new
  • [x] Implement __setitem__ for list of keys where some keys are already in the Series and others are new

Next are loc and at, which for Series, I'm pretty sure are just aliases for the [] operator. This will need more vigorous confirmation, but I think that's beyond the scope of this issue.

Like loc and at, because Series are one-dimensional, iloc and iat should implement the same behavior. This behavior is not the same as the [] operator, as these are position-based indexing rather than label-based.

brandon-neth avatar Jan 16 '24 23:01 brandon-neth

Just found another mismatch, for __setitem__. When the rhs is a list of length one, and the key is a repeated label, the single value in the list is assigned to all entries with the given key.

  • [x] rhs list, repeated label key

brandon-neth avatar Jan 17 '24 00:01 brandon-neth

Found a mismatch with likely performance implications. Consider the following Series and access to it.

s = Series(data=[0,1,2,3,4], index=['a','b','c','d','c'])
access = s[['c','d']]

In the implementation I wrote, the values of access are [2,3,4] and the labels are ['c','d','c']. The arguments to [] are used to calculate a mask.

In Pandas, the values of access are [2,4,3] and the labels are ['c','c','d']. Here, it seems to concatenate the results of the scalar accesses for each of the index arguments.

brandon-neth avatar Jan 18 '24 21:01 brandon-neth