modin
modin copied to clipboard
Arithmetic operations painfully slower compared to pandas
System information
- Windows 10, i7700hq 16gb ram:
Describe the problem
As you can see from beyond, modin performs addition of two series of 100 million numbers each 5 times slower than vanilla pandas. Is this expected behavior?
Source code
import modin.pandas as mpd
import pandas as pd
import numpy as np
from time import time
import modin
import sys
if __name__ == '__main__':
from distributed.client import Client
client = Client()
n = 100_000_000
a = np.random.random(n)
b = np.random.random(n)
mdf = mpd.DataFrame({
"a": a,
"b": b,
})
df = pd.DataFrame({
"a": a,
"b": b,
})
print(sys.version)
t = time()
mdf["a"] + mdf["b"]
print("modin",modin.__version__,time()-t)
t = time()
df["a"] + df["b"]
print("pandas",time()-t)
Output
3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:37:50) [MSC v.1916 64 bit (AMD64)]
modin 0.9.1 2.210862398147583
pandas 0.40199947357177734
Thanks @danpetruk for the report! This is not expected, but I can reproduce it.
When I run your code I get a similar difference. I know this is not expected because when I run mdf + mdf
I get a 280ms runtime, whereas I was getting ~1.1s for the mdf["a"] + mdf["b"]
. mdf + mdf
should be more expensive (at least slightly), so I suspect this is an issue with metadata for series binary operations. We will look into this, thanks for reporting!
To perform binary operation partitions of the second operand are broadcasted to partitions of the first operand, to be able to do broadcasting, partitioning of both frames has to be identical. PandasFrame._copartition
method is responsible for aligning partitioning, it's called in PandasFrame.binary_op
against operands before actual binary operation.
It was found out that the bottleneck in the case above is the _copartition
method, and actually this particular line:
https://github.com/modin-project/modin/blob/a3ddf2f01163a312416d2a8bc456ba9582ae9b4d/modin/engines/base/frame/data.py#L1911-L1912
get_axis_lengths
function retrieves partitions shapes of the passed frame in order to check whether partitioning of both operands is identical.
Running the code above + measuring the evaluation time of get_axis_lengts
gave the following results:
get_axis_lengths() took: 0.00013439892791211605
df + df operation took: 1.118077146122232
get_axis_lengths() took: 1.9074202890042216
df['a'] + df['b'] operation took: 2.905869127018377
The reason that retrieving shapes takes so much longer in the second case is that the cached values of shapes are missing:
df + df operation:
-> base_lengths = get_axis_lengths(reindexed_base, axis)
(Pdb) p reindexed_base[0][0]._length_cache
892858
df['a'] + df['b'] operation:
-> base_lengths = get_axis_lengths(reindexed_base, axis)
(Pdb) p reindexed_base[0][0]._length_cache
None
The missed cache does not correlate with the types of binary operations (frame + frame or series + series), it's about preprocessing operations that don't handle the cache accurately.
For now, I've found two of them in the above flow and created separate issues:
- Masking the original frame:
df["a"]
->QueryCompiler.getitem_column_array(["a"])
->PandasFrame.mask(col=["a"])
->PandasOnRayFramePartition.mask(row=slice(None), col=["a"])
-> #3110 - Setting new axis labels:
s1 + s2
->Series.add(s1, s2)
->Series._prepare_inter_op(s1, s2)
->Series._set_name("__reduced__")
->PandasFrame._set_columns()
-> #3111
Also found out that the binary operations in partition manager performed like full-axis functions, which is also can be a slow-down factor. @devin-petersohn are there any reasons why we perform even element-wise binary operations as full-axis, should it be a map-like function?
Need to check after #4391
I see ~3x speed up compare to pandas on this test for current modin master so I think this can be closed
Yes, this should be closed.
@Garra1980 what system did you test on? Did you enable benchmark mode? If you don't enable benchmark mode the Modin binary operation should happen async.
I would prefer not to close this issue now, because the reproducer from description uses Series
+Series
type of operation (this types of bin operations wasn't touched by #4391). Probably it's need to find the PR which fixed this issue. Probably it was one of PRs related mask
improvements..
If someone feels the issue should be kept open, do not hesitate to reopen it.
@Garra1980 what system did you test on? Did you enable benchmark mode? If you don't enable benchmark mode the Modin binary operation should happen async.
This is regular python script, what's with benchmarking mode?
@Garra1980
This is regular python script, what's with benchmarking mode?
We should turn on the modin setting benchmark mode for most performance comparisons. Otherwise, most operations, including some binary operations, will happen asynchronously and we'll underestimate the time they take.
The binary operation in the following script is (.887, .855) sec using Modin on dask on my mac when I turn benchmark mode on. When I keep benchmark mode off, I get (.331, .356) sec. For most operations the difference is actually much larger. I expected this script to be almost all async, but something about it is not.
import modin.pandas as pd
import numpy as np
from time import time
from modin.config import BenchmarkMode
random_state = np.random.RandomState(seed=42)
array = random_state.rand(2**22, 35)
BenchmarkMode.put(True)
df1 = pd.DataFrame(array)
start = time()
df1 = df1 - 1
end = time()
print(f"subtraction time: {end-start}")
My system:
- MacBook Pro (16-inch, 2019)
- macOS Monterey 12.4
- 2.3 GHz 8-core intel core i9
- Memory: 16 GB 2667 MHz DDR4
I got your point - I just was under impression that modin was slow so there is no async execution in the script
Anyway, we still seem to speed up here since I see 1.35s on master compare to 2.14 on 0.14 in Benchmarking mode
@Garra1980 how does Modin compare to pandas for your setup, though?
pandas is ~0.5s
#4689 improved performance for Series+Series (a couple days before the last bunch of comments)
When running the example from the issue description I see the following timings.
modin 0.27.0+8.g4704751c4 0.04345369338989258
pandas 0.05070161819458008
I think we can close this as resolved but if you see some slowdown on your side, feel free to reopen the issue or open a new one with a new description.