modin
modin copied to clipboard
PERF: MinPartitionSize default should be set to 1 for performance
https://github.com/modin-project/modin/blob/9782a027568d9ad16bf2c3dea434646cec5e4898/modin/config/envvars.py#L415
What's the rationale for this?
As it currently stands, the minimum number of columns in any partition is 32, that leads to limited parallelism. Often there are datasets with less than 32 columns, so we need to make sure we perform well on those datasets too.
I've tried out a few column-wise reductions and aggregations with min partition size of 1. The results are mixed for Ray on the following dataframe of about 1 million rows and 16 columns:
import modin.pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(100, size=(2 ** 20, 16)))
With min partition size 32 df._query_compiler._modin_frame._partitions.shape was (16, 1). With min partition size 1, it was (16, 16).
| Operation | time (ms) for min size 32 | time (ms) for min size 1 |
|---|---|---|
| df.count() | 15 | 120 |
| df.apply(sum, axis=0) | 1150 | 220 |
| df.apply("cumsum", axis=0) | 3 | 40 |
System information:
- Operating System: macOS Big Sure 11.5.2
- Memory: 16 GB 2667 MHz DDR4
- Modin version : 0.12.0+77.ged2a7a4a
- Python version: Python 3.9.9
- Ray: version 1.9.2
Is there a better way to test performance for this change?
@mvashishtha A couple of questions:
- What kind of performance do you get as the number of rows increases?
- Does anything change if you start the benchmark from a file?
@devin-petersohn I think it might be better to call repr to make sure the async operations complete.
Here are results for a few sizes running the benchmarking in ipython:
Click to show results
On the original dataframe with dimensions (2 ** 20, 16) I get:
| Operation | time (ms) for min size 32 | time (ms) for min size 1 |
|---|---|---|
| repr(df.count()) | 35 | 220 |
| repr(df.apply(sum, axis=0)) | 1090 | 260 |
| repr(df.apply("cumsum", axis=0)) | 1550 | 10740 |
On a dataframe of dimensions (2 ** 22, 16) I get:
| Operation | time (ms) for min size 32 | time (ms) for min size 1 |
|---|---|---|
| repr(df.count()) | 49 | 230 |
| repr(df.apply(sum, axis=0)) | 4370 | 715 |
| repr(df.apply("cumsum", axis=0)) | 6060 | 44000 |
On a dataframe of dimensions (2 ** 23, 16) I get:
| Operation | time (ms) for min size 32 | time (ms) for min size 1 |
|---|---|---|
| repr(df.count()) | 108 | 243 |
| repr(df.apply(sum, axis=0)) | 8747 | 1400 |
| repr(df.apply("cumsum", axis=0)) | 2430 | 87000 |
Here are the results running the benchmark from a file:
Click to show results.
On a dataframe of dimensions (2 ** 20, 16) I get:
| Operation | time (ms) for min size 32 | time (ms) for min size 1 |
|---|---|---|
| repr(df.count()) | 53 | 207 |
| repr(df.apply(sum, axis=0)) | 1098 | 258 |
| repr(df.apply("cumsum", axis=0)) | 255 | 180 |
On a dataframe of dimensions (2 ** 22, 16) I get:
| Operation | time (ms) for min size 32 | time (ms) for min size 1 |
|---|---|---|
| repr(df.count()) | 217 | 525 |
| repr(df.apply(sum, axis=0)) | 728 | 724 |
| repr(df.apply("cumsum", axis=0)) | 408 | 366 |
On a dataframe of dimensions (2 ** 23, 16) I get:
| Operation | time (ms) for min size 32 | time (ms) for min size 1 |
|---|---|---|
| repr(df.count()) | 223 | 227 |
| repr(df.apply(sum, axis=0)) | 1409 | 1407 |
| repr(df.apply("cumsum", axis=0)) | 1941 | 3718 |
from benchmark code:
Show benchmark code
import modin.pandas as pd
import numpy as np
import time
from modin.config import envvars
import warnings
warnings.filterwarnings('ignore', module='modin')
def print_runtimes(f, title):
times = []
for _ in range(4):
start = time.time()
f()
end = time.time()
times.append(end - start)
print(f'{title}: {times} with all except first averaging {sum(times[1:]) / (len(times) - 1)}')
for size in ( (2**20, 16), (2**22, 16), (2**23, 16)):
counter = lambda : repr(df.count())
summer = lambda : repr(df.apply(sum, axis=0))
cumsummer = lambda : repr(df.apply("cumsum", axis=0))
df = pd.DataFrame(np.random.randint(100, size=size))
print_runtimes(counter, f'partition size 32 and shape {size}: count')
print_runtimes(summer, f'partition size 32 and shape {size}: sum')
print_runtimes(cumsummer, f'partition size 32 and shape {size}: cumsum')
print('--------------\n')
envvars.MinPartitionSize.put(1)
df = pd.DataFrame(np.random.randint(100, size=size))
print_runtimes(counter, f'partition size 1 and shape {size}: count')
print_runtimes(summer, f'partition size 1 and shape {size}: sum')
print_runtimes(cumsummer, f'partition size 1 and shape {size}: cumsum')
print('---------------\n' * 5)
Does anything change if you start the benchmark from a file?
In ipython, at all sizes there's a large win for applying sum, a huge loss for applying cumsum, and a loss for count(). When running from a file, it looks like there are some wins for the apply functions for the two smallest datasets, but not for the largest dataset.
@devin-petersohn I think it might be better to call repr to make sure the async operations complete.
Not sure, what overhead comes from repr. It would probably be better to enable BenchmarkMode.
from modin.config import BenchmarkMode
BenchmarkMode.put(True)
@devin-petersohn, what do you think on having separate configuration settings MinPartitionLen and MinPartitionWidth?
@YarShev it makes sense. We don't want single-row partitions, but single-column partitions might be okay.
@YarShev
Not sure, what overhead comes from repr. It would probably be better to enable
BenchmarkMode.
when I enable benchmark mode I don't see any significant difference for the benchmark I linked with Ray on my Mac. I don't see it with either Ray or Dask on a Google Compute Engine Linux VM with specs:
- Operating system: Ubuntu 20.04.3 LTS
- 8 vCPU
- 32 GB RAM
I think we should try another benchmark.
@mvashishtha, could you play with MinPartitionLen and MinPartitionWidth settings?