arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

Strings groupby performance

Open stress-tess opened this issue 1 year ago • 1 comments

Overarching issue tracking string groupby performance. The way groupby is implemented relies heavily on unique and LSDRadixSort

  • [x] #2873
  • [ ] Investigate faster ways of handling radix sort >8 digits

Note: @ronawho is assigned only for his awareness and so he can weigh in on my wild ideas and provide suggestions if he wants to. I don't expect him to write any code for this

stress-tess avatar Oct 30 '23 22:10 stress-tess

Things to consider:

  • how does a known fixed short length change things?
  • can we pre-sort based on length and use that info somehow? Like if there's only one long string then that should be last and the remaining strings can take advantage of the short str fast path... this pretty easily scales up to ignoring any number of long strings provided they all have unique lengths
    • i.e. my thinking is something along the lines of coargsort([strings.get_lengths(), strings.to_uint_array()])
    • would something like a segmentedSort make sense where we first group on string length and then sort within the groups? If so, we might want to consider a different sorting algorithm since we'd likely be sorting on smaller chunks of data than radix sort was built/optimized for

stress-tess avatar Dec 05 '23 13:12 stress-tess