modin icon indicating copy to clipboard operation
modin copied to clipboard

FEAT-#5394: Reduce amount of remote calls for TreeReduce and GroupByReduce operators

Open Retribution98 opened this issue 1 year ago • 2 comments

Apply approaches from PR-7136 for TreeReduce and GroupByReduce operators

What do these changes do?

  • [x] first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • [x] passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • [x] passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • [x] signed commit with git commit -s
  • [x] Resolves #?
  • [x] tests added and passing
  • [x] module layout described at docs/development/architecture.rst is up-to-date

Retribution98 avatar May 08 '24 13:05 Retribution98

@Retribution98 do you have any performance numbers?

@anmyachev This case is similar to the previous PR, so we can expect the same performance. Using 112 CPU:

df.count partitions shape main this PR
(Using 112 CPU) (112, 1) 0.202289 0.19788
  (12544, 1) 13.67759 10.99517
  (112, 112) 4.544378 1.760422

Retribution98 avatar May 08 '24 16:05 Retribution98

It's also a good idea to add tests for the new operators, which now work a little differently.

Since the logic is at a lower level, I modified the test to test this and it covered all cases where map_partitions is used.

Retribution98 avatar May 08 '24 16:05 Retribution98

@Retribution98, could you also check performance for dtypes, which is part of https://github.com/modin-project/modin/issues/2751?

YarShev avatar May 13 '24 13:05 YarShev