modin
modin copied to clipboard
`merge` operation is significantly slower than stock pandas
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): all
- Modin version (
modin.__version__): 3f00e24b041913951e7fec8a13dff280adabb6ee - Python version: 3.8.11
- Code we can use to reproduce:
Benchmark code
import time
import numpy as np
import pandas as pd
import modin.pandas as mpd
import modin.config as cfg
import ray
cfg.BenchmarkMode.put(True)
ray.init()
def generate_data(nrows=5000, n_random_cols=100):
data1 = {
"ID": np.random.choice(nrows, nrows, replace=False),
"Float": np.random.rand(nrows),
}
data_random1 = {f"col{i}": np.random.rand(nrows) for i in range(n_random_cols)}
data1.update(data_random1)
data2 = {
"ID": np.random.choice(nrows, nrows, replace=False),
"Float_2": np.random.rand(nrows),
}
data_random2 = {f"COL{i}": np.random.rand(nrows) for i in range(n_random_cols)}
data2.update(data_random2)
df1 = mpd.DataFrame(data1)
df2 = mpd.DataFrame(data2)
return df1, df2
if __name__ == "__main__":
mdf1, mdf2 = generate_data(nrows=1_000_000, n_random_cols=998)
pdf1, pdf2 = mdf1._to_pandas(), mdf2._to_pandas()
print(f"Original shape: {pdf1.shape}")
t = time.time()
pdf1 = pdf1.merge(pdf2, on="ID", how="left")
print(f"pandas merge time: {time.time() - t} s")
print(pdf1.shape)
t = time.time()
mdf1 = mdf1.merge(mdf2, on="ID", how="left")
print(f"modin merge time: {time.time() - t} s")
print(mdf1.shape)
Describe the problem
merge operation is significantly slower than stock pandas. The operation is used "left" type of join, which is implemented with distributed way, but implementation looks non effective.
The results for Ray execution engine are follows (112 workers is used for Ray):
- The table with increasing number of rows, number of columns is fixed.
| Shape(rows, cols) | (1k, 32) | (10k, 32) | (100k, 32) | (1m, 32) | (10m, 32) | (100m, 32) |
|---|---|---|---|---|---|---|
| pandas | 0.0049 | 0.0089 | 0.0666 | 0.9076 | 13.76358 | 183.1597 |
| modin | 1.4042 | 1.1947 | 1.2528 | 4.4858 | 115.10042 | KILLED |
- The table with increasing number of columns, number of rows is fixed.
| Shape(rows, cols) | (1m, 32) | (1m, 100) | (1m, 1k) |
|---|---|---|---|
| pandas | 0.9076 | 3.2592 | 35.5945 |
| modin | 1.2528 | 34.6418 | 949.0544 |
Connected task for other join types: https://github.com/modin-project/modin/issues/656
Related one #3588.
I believe this is something @RehanSD should be working soon™.
@anmyachev did some work on this too so he and @RehanSD should contact each on the matter. Also, this blog might be useful for the issue - https://medium.com/analytics-vidhya/4-performance-improving-techniques-to-make-spark-joins-10x-faster-2ec8859138b4.
@arunjose696, can you revisit this?
A lot of improvements have been introduced recently to merge. Visit https://github.com/modin-project/modin/pull/6966 for the measurements. File a new issue if someone sees poor performance on a certain case of merge.