modin Arithmetic operations painfully slower compared to pandas

System information

Windows 10, i7700hq 16gb ram:

Describe the problem

As you can see from beyond, modin performs addition of two series of 100 million numbers each 5 times slower than vanilla pandas. Is this expected behavior?

Source code

import modin.pandas as mpd
import pandas as pd
import numpy as np
from time import time
import modin
import sys
if __name__ == '__main__':
    from distributed.client import Client
    client = Client()
    n = 100_000_000
    a = np.random.random(n)
    b = np.random.random(n)
    mdf = mpd.DataFrame({
        "a": a,
        "b": b,
    })
    df = pd.DataFrame({
        "a": a,
        "b": b,
    })

    print(sys.version)

    t = time()
    mdf["a"] + mdf["b"]
    print("modin",modin.__version__,time()-t)


    t = time()
    df["a"] + df["b"]
    print("pandas",time()-t)

Output

3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:37:50) [MSC v.1916 64 bit (AMD64)]
modin 0.9.1 2.210862398147583
pandas 0.40199947357177734

May 29 '21 23:05 danpetruk

Thanks @danpetruk for the report! This is not expected, but I can reproduce it.

When I run your code I get a similar difference. I know this is not expected because when I run mdf + mdf I get a 280ms runtime, whereas I was getting ~1.1s for the mdf["a"] + mdf["b"]. mdf + mdf should be more expensive (at least slightly), so I suspect this is an issue with metadata for series binary operations. We will look into this, thanks for reporting!

May 31 '21 17:05 devin-petersohn

To perform binary operation partitions of the second operand are broadcasted to partitions of the first operand, to be able to do broadcasting, partitioning of both frames has to be identical. PandasFrame._copartition method is responsible for aligning partitioning, it's called in PandasFrame.binary_op against operands before actual binary operation.

It was found out that the bottleneck in the case above is the _copartition method, and actually this particular line: https://github.com/modin-project/modin/blob/a3ddf2f01163a312416d2a8bc456ba9582ae9b4d/modin/engines/base/frame/data.py#L1911-L1912 get_axis_lengths function retrieves partitions shapes of the passed frame in order to check whether partitioning of both operands is identical.

Running the code above + measuring the evaluation time of get_axis_lengts gave the following results:

get_axis_lengths() took: 0.00013439892791211605
df + df operation took: 1.118077146122232

get_axis_lengths() took: 1.9074202890042216
df['a'] + df['b'] operation took: 2.905869127018377

The reason that retrieving shapes takes so much longer in the second case is that the cached values of shapes are missing:

df + df operation:
-> base_lengths = get_axis_lengths(reindexed_base, axis)
(Pdb) p reindexed_base[0][0]._length_cache
892858

df['a'] + df['b'] operation:
-> base_lengths = get_axis_lengths(reindexed_base, axis)
(Pdb) p reindexed_base[0][0]._length_cache
None

The missed cache does not correlate with the types of binary operations (frame + frame or series + series), it's about preprocessing operations that don't handle the cache accurately.

For now, I've found two of them in the above flow and created separate issues:

Masking the original frame: df["a"] -> QueryCompiler.getitem_column_array(["a"]) -> PandasFrame.mask(col=["a"]) -> PandasOnRayFramePartition.mask(row=slice(None), col=["a"]) -> #3110
Setting new axis labels: s1 + s2 -> Series.add(s1, s2) -> Series._prepare_inter_op(s1, s2) -> Series._set_name("__reduced__") -> PandasFrame._set_columns() -> #3111

Jun 01 '21 12:06 dchigarev

Also found out that the binary operations in partition manager performed like full-axis functions, which is also can be a slow-down factor. @devin-petersohn are there any reasons why we perform even element-wise binary operations as full-axis, should it be a map-like function?

Jun 01 '21 13:06 dchigarev

Need to check after #4391

Jun 30 '22 14:06 prutskov

I see ~3x speed up compare to pandas on this test for current modin master so I think this can be closed

Jul 28 '22 14:07 Garra1980

Yes, this should be closed.

Jul 28 '22 14:07 YarShev

@Garra1980 what system did you test on? Did you enable benchmark mode? If you don't enable benchmark mode the Modin binary operation should happen async.

Jul 28 '22 14:07 mvashishtha

I would prefer not to close this issue now, because the reproducer from description uses Series+Series type of operation (this types of bin operations wasn't touched by #4391). Probably it's need to find the PR which fixed this issue. Probably it was one of PRs related mask improvements..

Jul 28 '22 14:07 prutskov

If someone feels the issue should be kept open, do not hesitate to reopen it.

Jul 28 '22 14:07 YarShev

@Garra1980 what system did you test on? Did you enable benchmark mode? If you don't enable benchmark mode the Modin binary operation should happen async.

This is regular python script, what's with benchmarking mode?

Jul 28 '22 15:07 Garra1980

@Garra1980

This is regular python script, what's with benchmarking mode?

We should turn on the modin setting benchmark mode for most performance comparisons. Otherwise, most operations, including some binary operations, will happen asynchronously and we'll underestimate the time they take.

The binary operation in the following script is (.887, .855) sec using Modin on dask on my mac when I turn benchmark mode on. When I keep benchmark mode off, I get (.331, .356) sec. For most operations the difference is actually much larger. I expected this script to be almost all async, but something about it is not.

import modin.pandas as pd
import numpy as np
from time import time
from modin.config import BenchmarkMode

random_state = np.random.RandomState(seed=42)
array = random_state.rand(2**22, 35)
BenchmarkMode.put(True)


df1 = pd.DataFrame(array)

start = time()
df1 = df1 - 1
end = time()
print(f"subtraction time: {end-start}")

My system:

MacBook Pro (16-inch, 2019)
macOS Monterey 12.4
2.3 GHz 8-core intel core i9
Memory: 16 GB 2667 MHz DDR4

Jul 28 '22 15:07 mvashishtha

I got your point - I just was under impression that modin was slow so there is no async execution in the script

Anyway, we still seem to speed up here since I see 1.35s on master compare to 2.14 on 0.14 in Benchmarking mode

Jul 28 '22 16:07 Garra1980

@Garra1980 how does Modin compare to pandas for your setup, though?

Jul 28 '22 16:07 mvashishtha

pandas is ~0.5s

Jul 28 '22 17:07 Garra1980

#4689 improved performance for Series+Series (a couple days before the last bunch of comments)

Oct 18 '22 16:10 jbrockmendel

When running the example from the issue description I see the following timings.

modin 0.27.0+8.g4704751c4 0.04345369338989258
pandas 0.05070161819458008

I think we can close this as resolved but if you see some slowdown on your side, feel free to reopen the issue or open a new one with a new description.

Feb 20 '24 13:02 YarShev

modin modin copied to clipboard

Arithmetic operations painfully slower compared to pandas

System information

Describe the problem

Source code

Output

modin
modin copied to clipboard