modin
modin copied to clipboard
PERF-#4494: Get partition widths/lengths in parallel instead of serially
What do these changes do?
Computes widths and lengths of block partitions in parallel as batched calls to ray.get/DaskWrapper.materialize rather than in serial.
This adds the try_build_[length|width]_cache and try_set_[length|width]_cache methods to block partitions; the former returns a promise/future for computing the partition's length, and the latter should be called by the partition manager to inform the block partition of the computation's value. This also adds the _update_partition_dimension_caches to the PartitionManager class, which will call the length/width futures returned by its constituent partitions.
- [x] commit message follows format outlined here
- [x] passes
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py - [x] passes
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py - [x] signed commit with
git commit -s - [x] Resolves #4494
- [ ] tests added and passing
- [x] module layout described at
docs/development/architecture.rstis up-to-date - [ ] added (Issue Number: PR title (PR Number)) and github username to release notes for next major release
Codecov Report
Merging #4683 (490778c) into master (8e1190c) will decrease coverage by
13.12%. The diff coverage is67.93%.
@@ Coverage Diff @@
## master #4683 +/- ##
===========================================
- Coverage 85.28% 72.15% -13.13%
===========================================
Files 259 259
Lines 19378 19496 +118
===========================================
- Hits 16527 14068 -2459
- Misses 2851 5428 +2577
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...s/pandas_on_dask/partitioning/virtual_partition.py | 62.99% <0.00%> (-23.74%) |
:arrow_down: |
| ...ns/pandas_on_ray/partitioning/virtual_partition.py | 71.66% <6.66%> (-16.07%) |
:arrow_down: |
| ...lementations/pandas_on_dask/dataframe/dataframe.py | 80.76% <25.00%> (-15.07%) |
:arrow_down: |
| ...dataframe/pandas/partitioning/partition_manager.py | 75.67% <75.00%> (-10.79%) |
:arrow_down: |
| ...entations/pandas_on_dask/partitioning/partition.py | 79.77% <81.81%> (-9.25%) |
:arrow_down: |
| ...plementations/pandas_on_ray/dataframe/dataframe.py | 84.44% <82.50%> (-15.56%) |
:arrow_down: |
| modin/core/dataframe/pandas/dataframe/dataframe.py | 71.44% <100.00%> (-22.89%) |
:arrow_down: |
| ...in/core/dataframe/pandas/partitioning/partition.py | 100.00% <100.00%> (ø) |
|
| ...mentations/pandas_on_ray/partitioning/partition.py | 91.66% <100.00%> (+0.51%) |
:arrow_up: |
| ...ns/pandas_on_ray/partitioning/partition_manager.py | 83.50% <100.00%> (+2.68%) |
:arrow_up: |
| ... and 84 more |
:mega: Codecov can now indicate which changes are the most critical in Pull Requests. Learn more
Haven't taken a closer look at the implementation details, but do you have any benchmarks or performance measurements to compare with master?
Sadly no, and I’d appreciate some suggestions on what code to run. Rehan suggested manually invalidating the ._row_lengths_cache and .length_cache fields on a dataframe and its partitions, then ensuring they’re recomputed properly. It succeeds for simple examples, but I had trouble producing a Ray timeline, and I’m not sure how else to benchmark it (most API-level dataframe manipulations would probably hit the cached length/width).
On Wed, Jul 20, 2022 at 19:18 Karthik Velayutham @.***> wrote:
Haven't taken a closer look at the implementation details, but do you have any benchmarks or performance measurements to compare with master?
— Reply to this email directly, view it on GitHub https://github.com/modin-project/modin/pull/4683#issuecomment-1190960820, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFFY4GR46CDY7NCNZ722GSDVVCXO7ANCNFSM535Z7DMA . You are receiving this because you authored the thread.Message ID: @.***>
@noloerino @pyrito
Sadly no, and I’d appreciate some suggestions on what code to run.
I spent a while today trying to get a script that showcases the performance here without breaking anything in Modin, but I failed. Getting a reproducer is hard for a few reasons.
For one thing, this optimization is only useful for unusual cases like in #4493 where the partitions' call queues include costly operations. When there is no call queue, the partitions will execute all dataframe functions eagerly, simultaneously calculating shapes. The call queues are generally meant to carry cheap operations like transpose and reindexing, but the reproducer in that issue has a frame that is very expensive to serialize, so that even the transpose was expensive. There the slow code was in _copartition, which unnecessarily calculated the widths of the base frame. #4495 fixed that unnecessary recalculation, so that script no longer works. Also, every PandasDataFrame computes all the lengths when it filters empty subframes as soon as it's constructed here, so any Modin dataframe at rest already knows its partition shapes.
Looking at all the serial shape computations I listed here, most are in internal length computations. One is _copartition, and I spent a while trying to get around the cache fix in #4495 with a pair of frames that really needed copartitioning, but in that case the map_axis_partitions in _copartition triggers parallel computation. The last type of length computation is in apply_func_to_indices_both_axis, which as far as I can tell is only used in melt. We could try engineering an example that bypasses the cache for melt, but I don't think it's worth the time...
I think it's good practice to get multiple ray objects in parallel (see also this note about a similar improvement in _to_pandas). Also, if our caches fail for any reason later on, we can have faster length computation as a backup.
This adds a certain bit of complexity (judging by the number of lines change, haven't looked at the diff yet), and I haven't yet seen any performance proof for that. I would like to see some measurements before increasing our (already huge) codebase...
@vnlitvinov that makes sense, I'll look into coming up with concrete benchmarks.
@pyrito please have a look at https://github.com/vnlitvinov/modin/tree/speedup-masking and https://github.com/modin-project/modin/pull/4726, it might be doing somewhat the same in terms of getting the sizes in parallel
Related discussion on handling metadata (index and columns) in https://github.com/modin-project/modin/issues/3673.