modin
modin copied to clipboard
FEAT-#5394: Reduce amount of remote calls for TreeReduce and GroupByReduce operators
Apply approaches from PR-7136 for TreeReduce and GroupByReduce operators
What do these changes do?
- [x] first commit message and PR title follow format outlined here
NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
- [x] passes
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py - [x] passes
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py - [x] signed commit with
git commit -s - [x] Resolves #?
- [x] tests added and passing
- [x] module layout described at
docs/development/architecture.rstis up-to-date
@Retribution98 do you have any performance numbers?
@anmyachev This case is similar to the previous PR, so we can expect the same performance. Using 112 CPU:
| df.count | partitions shape | main | this PR |
|---|---|---|---|
| (Using 112 CPU) | (112, 1) | 0.202289 | 0.19788 |
| (12544, 1) | 13.67759 | 10.99517 | |
| (112, 112) | 4.544378 | 1.760422 |
It's also a good idea to add tests for the new operators, which now work a little differently.
Since the logic is at a lower level, I modified the test to test this and it covered all cases where map_partitions is used.
@Retribution98, could you also check performance for dtypes, which is part of https://github.com/modin-project/modin/issues/2751?