cudf
cudf copied to clipboard
[FEA] Refactor various conditional join implementations for simplicity and API consistency
Is your feature request related to a problem? Please describe. #9917 and #10037 added new join functions that use a mixture of hash lookups and AST expression evaluation. In the interest of time, a number of important performance improvements, API design questions, and general internal refactorings were overlooked. This issue aims to track those potential improvements for future work.
Describe the solution you'd like
- [x] Implement benchmarks for mixed joins. This is a critical first step to evaluate the importance of other changes.
- [ ] Implement object-oriented APIs for mixed joins.
- [ ] Remove size APIs for semi/anti joins (both mixed and pure conditional) in favor of a single-kernel approach. The size APIs aren't really necessary since the primary use case for those is knowing when the results will be large enough to spill in multi-GPU cases, and for semi/anti joins the size is bounded by just the number of rows a table N (rather than N^2). It would be simpler and more efficient to use an approach like the hash semi/anti joins, which essentially just generate a gather mask.
- [x] #15250 for mixed semi/anti-joins
- [ ] https://github.com/rapidsai/cudf/pull/14646#discussion_r1568113565 for conditional semi/anti-joins
- [x] Rework conditional join internals for semi/anti joins to use kernels that don't allocate the second output vector, which is wasteful.
- [x] https://github.com/rapidsai/cudf/pull/14646
- [ ] Find cleaner solutions for the expression evaluator shared memory handling, see https://github.com/rapidsai/cudf/pull/9917#discussion_r779021242.
- [ ] Reduce compile time if possible (https://github.com/rapidsai/cudf/pull/9917#discussion_r771559204).
- [ ] Explore more code sharing (perhaps via templating) between mixed and pure conditional joins, as well as more sharing with the existing hash join infrastructure (should be facilitated by implementing the object-oriented API).
- [ ] Revisit naming of different join APIs. In the long term the joins probably shouldn't include names based on whether they are equality, conditional, or mixed joins, they should just be named by the type of join (inner, left, etc) and rely on the signature to differentiate the rest.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Hi, I'd like to try helping out with some of this work. I've a few questions to start off with though. For example, for the first unchecked task "Implement OOP APIs for mixed joins" is there an associated issue with this? A simple search through the available issues does not show anything.
Is there something wrong with the current API that makes it hard for downstream users to use? Is there a general direction for refactoring the API?
Also another example, I'm still new to cudf(I also admit to having never used it before either). If I am understanding correctly, if you remove the size API from the join
s this could be a breaking change correct?
edit: Ok, I will take another glance when I have time and report back here if i run into any problems. I think the size API one should be reasonable for me to approach.
Hi @ZelboK sorry for the slow response, let me try and answer your questions now:
For example, for the first unchecked task "Implement OOP APIs for mixed joins" is there an associated issue with this? A simple search through the available issues does not show anything.
There is not a separate issue for this, no. What this is referring to is that we have a hash_join
class that may be used to perform multiple joins where one of the tables is always the same. By building the hash table once and then probing it repeatedly, we can amortize the cost of building the hash table. That object supports hash joins only right now, but mixed joins would also benefit from the same performance gains since they are part hash join.
Is there something wrong with the current API that makes it hard for downstream users to use? Is there a general direction for refactoring the API?
More generally here I think we'd like to migrate towards API names that are less confusing. It's not obvious what a "mixed" or "conditional" join is. Ideally we'd have the function signature tell us everything we need about the nature of the join (equality vs arbitrary AST-based condition).
If I am understanding correctly, if you remove the size API from the joins this could be a breaking change correct?
Yes, this would be a breaking change. In general we are OK with making breaking changes, though, so that's not a problem. We just need to communicate that effectively. In this case I don't expect the breaking change to affect many users because the current approach doesn't add any meaningful functionality.
All good. I haven't had time anyway. I will have some time this week though.
With that being said, are the devs active on any specific slack channel that isn't #general? There's #oss-development but not many people are in it. I personally would like to try these tasks out independently without guidance(so I can learn more), but I occasionally have some questions. Are these issues the best way to communicate?
Are these issues the best way to communicate?
Issues (or comments on pull requests) are good for communication about specific topics in the code or coordinating work. Slack is fine for general questions.
So for the removing size APIs, are you referring to these functions?
https://docs.rapids.ai/api/libcudf/nightly/group__column__join.html#ga00b702723fd8953d5de802bc37965525 https://docs.rapids.ai/api/libcudf/nightly/group__column__join.html#gaecfa4e8182521bb5630adf1bb0b609c2
I imagine this just means to remove them completely and their uses anywhere? Or is there more to this issue that I am missing? In the original paragraph I see the emphasis on wanting to transition over to a "single-kernel approach". IIRC mixed_join_semi
calls several kernels so perhaps it is desired to refactor this to reduce the overhead of the additional kernels?
Hi,
I actually have some code set up. I intend on pushing a PR up hopefully this week should I have enough time. That way I can get feedback on whether or not my design is appropriate or not too. Is it possible to assign some of these issues to myself?
This is a meta-issue with several independent tasks. We can cross-link the PR in the checklist and assign the PR to you once it’s open. Thanks!