What do these changes do?

Computes widths and lengths of block partitions in parallel as batched calls to ray.get/DaskWrapper.materialize rather than in serial.

This adds the try_build_[length|width]_cache and try_set_[length|width]_cache methods to block partitions; the former returns a promise/future for computing the partition's length, and the latter should be called by the partition manager to inform the block partition of the computation's value. This also adds the _update_partition_dimension_caches to the PartitionManager class, which will call the length/width futures returned by its constituent partitions.

[x] commit message follows format outlined here
[x] passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
[x] passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
[x] signed commit with git commit -s
[x] Resolves #4494
[ ] tests added and passing
[x] module layout described at docs/development/architecture.rst is up-to-date
[ ] added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

Jul 18 '22 21:07 noloerino

Codecov Report

Merging #4683 (490778c) into master (8e1190c) will decrease coverage by 13.12%. The diff coverage is 67.93%.

@@             Coverage Diff             @@
##           master    #4683       +/-   ##
===========================================
- Coverage   85.28%   72.15%   -13.13%     
===========================================
  Files         259      259               
  Lines       19378    19496      +118     
===========================================
- Hits        16527    14068     -2459     
- Misses       2851     5428     +2577

Impacted Files	Coverage Δ
...s/pandas_on_dask/partitioning/virtual_partition.py	`62.99% <0.00%> (-23.74%)`	:arrow_down:
...ns/pandas_on_ray/partitioning/virtual_partition.py	`71.66% <6.66%> (-16.07%)`	:arrow_down:
...lementations/pandas_on_dask/dataframe/dataframe.py	`80.76% <25.00%> (-15.07%)`	:arrow_down:
...dataframe/pandas/partitioning/partition_manager.py	`75.67% <75.00%> (-10.79%)`	:arrow_down:
...entations/pandas_on_dask/partitioning/partition.py	`79.77% <81.81%> (-9.25%)`	:arrow_down:
...plementations/pandas_on_ray/dataframe/dataframe.py	`84.44% <82.50%> (-15.56%)`	:arrow_down:
modin/core/dataframe/pandas/dataframe/dataframe.py	`71.44% <100.00%> (-22.89%)`	:arrow_down:
...in/core/dataframe/pandas/partitioning/partition.py	`100.00% <100.00%> (ø)`
...mentations/pandas_on_ray/partitioning/partition.py	`91.66% <100.00%> (+0.51%)`	:arrow_up:
...ns/pandas_on_ray/partitioning/partition_manager.py	`83.50% <100.00%> (+2.68%)`	:arrow_up:
... and 84 more

:mega: Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Jul 19 '22 01:07 codecov[bot]

Haven't taken a closer look at the implementation details, but do you have any benchmarks or performance measurements to compare with master?

Jul 21 '22 02:07 pyrito

Sadly no, and I’d appreciate some suggestions on what code to run. Rehan suggested manually invalidating the ._row_lengths_cache and .length_cache fields on a dataframe and its partitions, then ensuring they’re recomputed properly. It succeeds for simple examples, but I had trouble producing a Ray timeline, and I’m not sure how else to benchmark it (most API-level dataframe manipulations would probably hit the cached length/width).

On Wed, Jul 20, 2022 at 19:18 Karthik Velayutham @.***> wrote:

Haven't taken a closer look at the implementation details, but do you have any benchmarks or performance measurements to compare with master?

— Reply to this email directly, view it on GitHub https://github.com/modin-project/modin/pull/4683#issuecomment-1190960820, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFFY4GR46CDY7NCNZ722GSDVVCXO7ANCNFSM535Z7DMA . You are receiving this because you authored the thread.Message ID: @.***>

Jul 21 '22 02:07 noloerino

@noloerino @pyrito

Sadly no, and I’d appreciate some suggestions on what code to run.

I spent a while today trying to get a script that showcases the performance here without breaking anything in Modin, but I failed. Getting a reproducer is hard for a few reasons.

For one thing, this optimization is only useful for unusual cases like in #4493 where the partitions' call queues include costly operations. When there is no call queue, the partitions will execute all dataframe functions eagerly, simultaneously calculating shapes. The call queues are generally meant to carry cheap operations like transpose and reindexing, but the reproducer in that issue has a frame that is very expensive to serialize, so that even the transpose was expensive. There the slow code was in _copartition, which unnecessarily calculated the widths of the base frame. #4495 fixed that unnecessary recalculation, so that script no longer works. Also, every PandasDataFrame computes all the lengths when it filters empty subframes as soon as it's constructed here, so any Modin dataframe at rest already knows its partition shapes.

Looking at all the serial shape computations I listed here, most are in internal length computations. One is _copartition, and I spent a while trying to get around the cache fix in #4495 with a pair of frames that really needed copartitioning, but in that case the map_axis_partitions in _copartition triggers parallel computation. The last type of length computation is in apply_func_to_indices_both_axis, which as far as I can tell is only used in melt. We could try engineering an example that bypasses the cache for melt, but I don't think it's worth the time...

I think it's good practice to get multiple ray objects in parallel (see also this note about a similar improvement in _to_pandas). Also, if our caches fail for any reason later on, we can have faster length computation as a backup.

Jul 21 '22 05:07 mvashishtha

This adds a certain bit of complexity (judging by the number of lines change, haven't looked at the diff yet), and I haven't yet seen any performance proof for that. I would like to see some measurements before increasing our (already huge) codebase...

Jul 21 '22 06:07 vnlitvinov

@vnlitvinov that makes sense, I'll look into coming up with concrete benchmarks.

Jul 26 '22 23:07 noloerino

@pyrito please have a look at https://github.com/vnlitvinov/modin/tree/speedup-masking and https://github.com/modin-project/modin/pull/4726, it might be doing somewhat the same in terms of getting the sizes in parallel

Jul 27 '22 13:07 vnlitvinov

Related discussion on handling metadata (index and columns) in https://github.com/modin-project/modin/issues/3673.

Jul 27 '22 18:07 YarShev

modin
modin copied to clipboard

PERF-#4494: Get partition widths/lengths in parallel instead of serially

What do these changes do?

Codecov Report

modin modin copied to clipboard

PERF-#4494: Get partition widths/lengths in parallel instead of serially

What do these changes do?

Codecov Report

modin
modin copied to clipboard