arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

Categorical.sort bug

Open ajpotts opened this issue 9 months ago • 1 comments

Categorical.sort has a bug and no unit tests:

In [67]:         c1 = Categorical(ak.array(["a", "a", "b"]))
    ...:         c2 = Categorical(ak.array(["a", "b", "a"]))

In [68]: c2.sort()
Out[68]: array(['a', 'b', 'a'])


ajpotts avatar May 03 '24 17:05 ajpotts

I couldn't help but dig into this a bit. It seems bill wrote this over four years ago, so it's kinda crazy it hasn't popped up until now (maybe one of the underlying functions has changed since then)

https://github.com/Bears-R-Us/arkouda/blob/bc641635c9e01d1a66abd8a7a199e1087645c622/arkouda/categorical.py#L810-L816

EDIT: The below approach doesn't work and just happened to work on the example. This is pretty obvious when you think about the respective lengths of the categories and codes

Incorrect approach

I messed around with this just a tiny bit and I don't quite understand why we need the inverse and why we can't just use the argsort to reorder the codes. I tried this out on master

def sort(self):
    # __doc__ = sort.__doc__
    idxperm = argsort(self.categories)
    return Categorical.from_codes(self.codes[idxperm], self.categories)

and it does seem to do what we want at least for this case (with comm=none):

In [3]: s = ak.array(["a", "b", "a"])

In [4]: c = ak.Categorical(s)

In [5]: c.sort()
Out[5]: array(['a', 'a', 'b'])

I'd need to think about it more to see if there's any issues with this approach

stress-tess avatar May 08 '24 19:05 stress-tess