[PERF] Improve "isin" performance by only sorting once
Is your feature request related to a problem? Please describe.
While benchmarking cuDF-python, I noticed that bench_isin has low end-to-end data throughput (<10GB/s). A closer look at the profiles showed that the data is being sorted twice, first with .sort_values() and then as part of drop_duplicates(). The following profile is for a test dataframe with 1 col and 100K rows, and is uses the isin argument range(1000) in the bench_isin benchmark.
When calling isin with a dataframe or dict argument, the profile shows two calls to Frame.argsort.
Describe the solution you'd like
For a performance improvement, I'd like to refactor isin to only sort the data once. We should prefer the libcudf unique function to drop_duplicates for pre-sorted data.
Describe alternatives you've considered n/a
Additional context Add any other context, code examples, or references to existing implementations about the feature request here.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
I see an extraneous sort_values call in Column._obtain_isin_result that can be removed, but I don't see any impact on performance locally when I make that change. Perhaps you can try that locally and see if it improves things. However, I suspect part of the problem is that we are explicitly sorting and then unsorting the data. I'd have to dig a bit deeper to understand exactly why on the previous line we are actually joining and requesting a sorted result only to unsort afterwards. I assume there's some pandas compatibility at play here, perhaps @galipremsagar or @shwina knows offhand.