modin icon indicating copy to clipboard operation
modin copied to clipboard

Arithmetic operations painfully slower compared to pandas

Open danpetruk opened this issue 3 years ago • 14 comments

System information

  • Windows 10, i7700hq 16gb ram:

Describe the problem

As you can see from beyond, modin performs addition of two series of 100 million numbers each 5 times slower than vanilla pandas. Is this expected behavior?

Source code

import modin.pandas as mpd
import pandas as pd
import numpy as np
from time import time
import modin
import sys
if __name__ == '__main__':
    from distributed.client import Client
    client = Client()
    n = 100_000_000
    a = np.random.random(n)
    b = np.random.random(n)
    mdf = mpd.DataFrame({
        "a": a,
        "b": b,
    })
    df = pd.DataFrame({
        "a": a,
        "b": b,
    })

    print(sys.version)

    t = time()
    mdf["a"] + mdf["b"]
    print("modin",modin.__version__,time()-t)


    t = time()
    df["a"] + df["b"]
    print("pandas",time()-t)

Output

3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:37:50) [MSC v.1916 64 bit (AMD64)]
modin 0.9.1 2.210862398147583
pandas 0.40199947357177734

danpetruk avatar May 29 '21 23:05 danpetruk

Thanks @danpetruk for the report! This is not expected, but I can reproduce it.

When I run your code I get a similar difference. I know this is not expected because when I run mdf + mdf I get a 280ms runtime, whereas I was getting ~1.1s for the mdf["a"] + mdf["b"]. mdf + mdf should be more expensive (at least slightly), so I suspect this is an issue with metadata for series binary operations. We will look into this, thanks for reporting!

devin-petersohn avatar May 31 '21 17:05 devin-petersohn

To perform binary operation partitions of the second operand are broadcasted to partitions of the first operand, to be able to do broadcasting, partitioning of both frames has to be identical. PandasFrame._copartition method is responsible for aligning partitioning, it's called in PandasFrame.binary_op against operands before actual binary operation.

It was found out that the bottleneck in the case above is the _copartition method, and actually this particular line: https://github.com/modin-project/modin/blob/a3ddf2f01163a312416d2a8bc456ba9582ae9b4d/modin/engines/base/frame/data.py#L1911-L1912 get_axis_lengths function retrieves partitions shapes of the passed frame in order to check whether partitioning of both operands is identical.

Running the code above + measuring the evaluation time of get_axis_lengts gave the following results:

get_axis_lengths() took: 0.00013439892791211605
df + df operation took: 1.118077146122232

get_axis_lengths() took: 1.9074202890042216
df['a'] + df['b'] operation took: 2.905869127018377

The reason that retrieving shapes takes so much longer in the second case is that the cached values of shapes are missing:

df + df operation:
-> base_lengths = get_axis_lengths(reindexed_base, axis)
(Pdb) p reindexed_base[0][0]._length_cache
892858

df['a'] + df['b'] operation:
-> base_lengths = get_axis_lengths(reindexed_base, axis)
(Pdb) p reindexed_base[0][0]._length_cache
None

The missed cache does not correlate with the types of binary operations (frame + frame or series + series), it's about preprocessing operations that don't handle the cache accurately.

For now, I've found two of them in the above flow and created separate issues:

  1. Masking the original frame: df["a"] -> QueryCompiler.getitem_column_array(["a"]) -> PandasFrame.mask(col=["a"]) -> PandasOnRayFramePartition.mask(row=slice(None), col=["a"]) -> #3110
  2. Setting new axis labels: s1 + s2 -> Series.add(s1, s2) -> Series._prepare_inter_op(s1, s2) -> Series._set_name("__reduced__") -> PandasFrame._set_columns() -> #3111

dchigarev avatar Jun 01 '21 12:06 dchigarev

Also found out that the binary operations in partition manager performed like full-axis functions, which is also can be a slow-down factor. @devin-petersohn are there any reasons why we perform even element-wise binary operations as full-axis, should it be a map-like function?

dchigarev avatar Jun 01 '21 13:06 dchigarev

Need to check after #4391

prutskov avatar Jun 30 '22 14:06 prutskov

I see ~3x speed up compare to pandas on this test for current modin master so I think this can be closed

Garra1980 avatar Jul 28 '22 14:07 Garra1980

Yes, this should be closed.

YarShev avatar Jul 28 '22 14:07 YarShev

@Garra1980 what system did you test on? Did you enable benchmark mode? If you don't enable benchmark mode the Modin binary operation should happen async.

mvashishtha avatar Jul 28 '22 14:07 mvashishtha

I would prefer not to close this issue now, because the reproducer from description uses Series+Series type of operation (this types of bin operations wasn't touched by #4391). Probably it's need to find the PR which fixed this issue. Probably it was one of PRs related mask improvements..

prutskov avatar Jul 28 '22 14:07 prutskov

If someone feels the issue should be kept open, do not hesitate to reopen it.

YarShev avatar Jul 28 '22 14:07 YarShev

@Garra1980 what system did you test on? Did you enable benchmark mode? If you don't enable benchmark mode the Modin binary operation should happen async.

This is regular python script, what's with benchmarking mode?

Garra1980 avatar Jul 28 '22 15:07 Garra1980

@Garra1980

This is regular python script, what's with benchmarking mode?

We should turn on the modin setting benchmark mode for most performance comparisons. Otherwise, most operations, including some binary operations, will happen asynchronously and we'll underestimate the time they take.

The binary operation in the following script is (.887, .855) sec using Modin on dask on my mac when I turn benchmark mode on. When I keep benchmark mode off, I get (.331, .356) sec. For most operations the difference is actually much larger. I expected this script to be almost all async, but something about it is not.

import modin.pandas as pd
import numpy as np
from time import time
from modin.config import BenchmarkMode

random_state = np.random.RandomState(seed=42)
array = random_state.rand(2**22, 35)
BenchmarkMode.put(True)


df1 = pd.DataFrame(array)

start = time()
df1 = df1 - 1
end = time()
print(f"subtraction time: {end-start}")

My system:

  • MacBook Pro (16-inch, 2019)
  • macOS Monterey 12.4
  • 2.3 GHz 8-core intel core i9
  • Memory: 16 GB 2667 MHz DDR4

mvashishtha avatar Jul 28 '22 15:07 mvashishtha

I got your point - I just was under impression that modin was slow so there is no async execution in the script

Anyway, we still seem to speed up here since I see 1.35s on master compare to 2.14 on 0.14 in Benchmarking mode

Garra1980 avatar Jul 28 '22 16:07 Garra1980

@Garra1980 how does Modin compare to pandas for your setup, though?

mvashishtha avatar Jul 28 '22 16:07 mvashishtha

pandas is ~0.5s

Garra1980 avatar Jul 28 '22 17:07 Garra1980

#4689 improved performance for Series+Series (a couple days before the last bunch of comments)

jbrockmendel avatar Oct 18 '22 16:10 jbrockmendel

When running the example from the issue description I see the following timings.

modin 0.27.0+8.g4704751c4 0.04345369338989258
pandas 0.05070161819458008

I think we can close this as resolved but if you see some slowdown on your side, feel free to reopen the issue or open a new one with a new description.

YarShev avatar Feb 20 '24 13:02 YarShev