arkouda
arkouda copied to clipboard
Strings groupby performance
Overarching issue tracking string groupby performance. The way groupby is implemented relies heavily on unique
and LSDRadixSort
- [x] #2873
- [ ] Investigate faster ways of handling radix sort >8 digits
Note: @ronawho is assigned only for his awareness and so he can weigh in on my wild ideas and provide suggestions if he wants to. I don't expect him to write any code for this
Things to consider:
- how does a known fixed short length change things?
- can we pre-sort based on length and use that info somehow? Like if there's only one long string then that should be last and the remaining strings can take advantage of the short str fast path... this pretty easily scales up to ignoring any number of long strings provided they all have unique lengths
- i.e. my thinking is something along the lines of
coargsort([strings.get_lengths(), strings.to_uint_array()])
- would something like a
segmentedSort
make sense where we first group on string length and then sort within the groups? If so, we might want to consider a different sorting algorithm since we'd likely be sorting on smaller chunks of data than radix sort was built/optimized for
- i.e. my thinking is something along the lines of