modin
modin copied to clipboard
Reduce the amount of remote calls when reshuffling partitions
Currently, the splitting stage of reshuffling generates up to NCORES ^ 2
amount of partitions. There are NCORES
amount of row_partitions
submitting the splitting kernels each additionally generating NCORES
split partitions:
https://github.com/modin-project/modin/blob/cdedd710f2ea092f72316600335b9e8e15995d28/modin/core/dataframe/pandas/partitioning/partition_manager.py#L1591-L1599
Operating with such amount of partitions causes Ray to perform badly (we had already hit it at #5394).
It's proposed to reduce the number of splitting kernels by combining row parts into column parts as follows:
num_existing_row_partitions = len(row_partitions)
# print(row_partitions.shape) -> # (X, 1) -> will result into a (X, X) split partitions
ideal_num_new_row_partitions = int(num_existing_row_partitions ** 0.5)
chunk_size = compute_chunksize(
num_existing_row_partitions, ideal_num_new_row_partitions, min_block_size=1
)
new_row_partitions = np.array(
[
cls._column_partitions_class(row_partitions[i : i + chunk_size])
for i in range(0, num_existing_row_partitions, chunk_size)
]
)
# print(new_row_partitions.shape) -> # (X ** 0.5, 1) -> will result into a (X ** 0.5, X) split partitions
Applying the changes gives the following results on my single-node machine:
CPU: Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz, 112 threads
case | modin + #5780 | modin + #5780 + draft impl for this tracker |
---|---|---|
repro from #5533 | 3.32s | 1.74s |
show-case from https://github.com/modin-project/modin/pull/4601#issuecomment-1331544461 with 100 cols | 3.75s | 4.44s |
show-case from https://github.com/modin-project/modin/pull/4601#issuecomment-1331544461 with 32 cols | 2.46s | 1.84s |
As you can see, the diffs vary from 2x speed up to 0.18x slow down, I'm still investigating a correlation to make a heuristic dispatching between these two implementations.
I also understand that performance in the cluster environment may react differently to this change as constructing these column partitions will cause data transferring between nodes which may be expensive. Maybe we can also dispatch between the old and the new logic depending on the IsRayCluster
environment variable being set?
cc @modin-project/modin-core @RehanSD
@dchigarev, could you revisit this issue?