modin PERF: MinPartitionSize default should be set to 1 for performance

https://github.com/modin-project/modin/blob/9782a027568d9ad16bf2c3dea434646cec5e4898/modin/config/envvars.py#L415

Jan 21 '22 11:01 devin-petersohn

What's the rationale for this?

Jan 21 '22 16:01 mvashishtha

As it currently stands, the minimum number of columns in any partition is 32, that leads to limited parallelism. Often there are datasets with less than 32 columns, so we need to make sure we perform well on those datasets too.

Jan 21 '22 16:01 devin-petersohn

I've tried out a few column-wise reductions and aggregations with min partition size of 1. The results are mixed for Ray on the following dataframe of about 1 million rows and 16 columns:

import modin.pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(100, size=(2 ** 20, 16)))

With min partition size 32 df._query_compiler._modin_frame._partitions.shape was (16, 1). With min partition size 1, it was (16, 16).

Operation	time (ms) for min size 32	time (ms) for min size 1
df.count()	15	120
df.apply(sum, axis=0)	1150	220
df.apply("cumsum", axis=0)	3	40

System information:

Operating System: macOS Big Sure 11.5.2
Memory: 16 GB 2667 MHz DDR4
Modin version : 0.12.0+77.ged2a7a4a
Python version: Python 3.9.9
Ray: version 1.9.2

Is there a better way to test performance for this change?

Jan 21 '22 17:01 mvashishtha

@mvashishtha A couple of questions:

What kind of performance do you get as the number of rows increases?
Does anything change if you start the benchmark from a file?

Jan 21 '22 17:01 devin-petersohn

@devin-petersohn I think it might be better to call repr to make sure the async operations complete.

Here are results for a few sizes running the benchmarking in ipython:

Click to show results

On the original dataframe with dimensions (2 ** 20, 16) I get:

Operation	time (ms) for min size 32	time (ms) for min size 1
repr(df.count())	35	220
repr(df.apply(sum, axis=0))	1090	260
repr(df.apply("cumsum", axis=0))	1550	10740

On a dataframe of dimensions (2 ** 22, 16) I get:

Operation	time (ms) for min size 32	time (ms) for min size 1
repr(df.count())	49	230
repr(df.apply(sum, axis=0))	4370	715
repr(df.apply("cumsum", axis=0))	6060	44000

On a dataframe of dimensions (2 ** 23, 16) I get:

Operation	time (ms) for min size 32	time (ms) for min size 1
repr(df.count())	108	243
repr(df.apply(sum, axis=0))	8747	1400
repr(df.apply("cumsum", axis=0))	2430	87000

Here are the results running the benchmark from a file:

Click to show results.

On a dataframe of dimensions (2 ** 20, 16) I get:

Operation	time (ms) for min size 32	time (ms) for min size 1
repr(df.count())	53	207
repr(df.apply(sum, axis=0))	1098	258
repr(df.apply("cumsum", axis=0))	255	180

On a dataframe of dimensions (2 ** 22, 16) I get:

Operation	time (ms) for min size 32	time (ms) for min size 1
repr(df.count())	217	525
repr(df.apply(sum, axis=0))	728	724
repr(df.apply("cumsum", axis=0))	408	366

On a dataframe of dimensions (2 ** 23, 16) I get:

Operation	time (ms) for min size 32	time (ms) for min size 1
repr(df.count())	223	227
repr(df.apply(sum, axis=0))	1409	1407
repr(df.apply("cumsum", axis=0))	1941	3718

from benchmark code:

Show benchmark code

import modin.pandas as pd
import numpy as np
import time
from modin.config import envvars
import warnings

warnings.filterwarnings('ignore', module='modin')


def print_runtimes(f, title):
    times = []
    for _ in range(4):
        start = time.time()
        f()
        end = time.time()
        times.append(end - start)
    print(f'{title}: {times} with all except first averaging {sum(times[1:]) / (len(times) - 1)}')


for size in ( (2**20, 16), (2**22, 16), (2**23, 16)):
    counter = lambda : repr(df.count())
    summer = lambda : repr(df.apply(sum, axis=0))
    cumsummer = lambda : repr(df.apply("cumsum", axis=0))

    df = pd.DataFrame(np.random.randint(100, size=size))
    print_runtimes(counter, f'partition size 32 and shape {size}: count')
    print_runtimes(summer, f'partition size 32 and shape {size}: sum')
    print_runtimes(cumsummer, f'partition size 32 and shape {size}: cumsum')

    print('--------------\n')

    envvars.MinPartitionSize.put(1)
    df = pd.DataFrame(np.random.randint(100, size=size))
    print_runtimes(counter, f'partition size 1 and shape {size}: count')
    print_runtimes(summer, f'partition size 1 and shape {size}: sum')
    print_runtimes(cumsummer, f'partition size 1 and shape {size}: cumsum')

    print('---------------\n' * 5)

Does anything change if you start the benchmark from a file?

In ipython, at all sizes there's a large win for applying sum, a huge loss for applying cumsum, and a loss for count(). When running from a file, it looks like there are some wins for the apply functions for the two smallest datasets, but not for the largest dataset.

Jan 21 '22 19:01 mvashishtha

@devin-petersohn I think it might be better to call repr to make sure the async operations complete.

Not sure, what overhead comes from repr. It would probably be better to enable BenchmarkMode.

from modin.config import BenchmarkMode
BenchmarkMode.put(True)

@devin-petersohn, what do you think on having separate configuration settings MinPartitionLen and MinPartitionWidth?

Jan 26 '22 12:01 YarShev

@YarShev it makes sense. We don't want single-row partitions, but single-column partitions might be okay.

Jan 26 '22 14:01 devin-petersohn

@YarShev

Not sure, what overhead comes from repr. It would probably be better to enable BenchmarkMode.

when I enable benchmark mode I don't see any significant difference for the benchmark I linked with Ray on my Mac. I don't see it with either Ray or Dask on a Google Compute Engine Linux VM with specs:

Operating system: Ubuntu 20.04.3 LTS
8 vCPU
32 GB RAM

I think we should try another benchmark.

Jan 26 '22 17:01 mvashishtha

@mvashishtha, could you play with MinPartitionLen and MinPartitionWidth settings?

Jan 26 '22 17:01 YarShev

modin modin copied to clipboard

PERF: MinPartitionSize default should be set to 1 for performance

modin
modin copied to clipboard