modin `merge` operation is significantly slower than stock pandas

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): all
Modin version (modin.__version__): 3f00e24b041913951e7fec8a13dff280adabb6ee
Python version: 3.8.11
Code we can use to reproduce:

Benchmark code

import time

import numpy as np
import pandas as pd

import modin.pandas as mpd
import modin.config as cfg
import ray

cfg.BenchmarkMode.put(True)
ray.init()

def generate_data(nrows=5000, n_random_cols=100):
    data1 = {
        "ID": np.random.choice(nrows, nrows, replace=False),
        "Float": np.random.rand(nrows),
    }
    data_random1 = {f"col{i}":  np.random.rand(nrows) for i in range(n_random_cols)}
    data1.update(data_random1)


    data2 = {
        "ID": np.random.choice(nrows, nrows, replace=False),
        "Float_2": np.random.rand(nrows),
    }
    data_random2 = {f"COL{i}":  np.random.rand(nrows) for i in range(n_random_cols)}
    data2.update(data_random2)
    
    df1 = mpd.DataFrame(data1)
    df2 = mpd.DataFrame(data2)

    return df1, df2

if __name__ == "__main__":
    mdf1, mdf2 = generate_data(nrows=1_000_000, n_random_cols=998)
    pdf1, pdf2 = mdf1._to_pandas(), mdf2._to_pandas()

    print(f"Original shape: {pdf1.shape}")

    t = time.time()
    pdf1 = pdf1.merge(pdf2, on="ID", how="left")
    print(f"pandas merge time: {time.time() - t} s")
    print(pdf1.shape)

    t = time.time()
    mdf1 = mdf1.merge(mdf2, on="ID", how="left")
    print(f"modin merge time: {time.time() - t} s")
    print(mdf1.shape)

Describe the problem

merge operation is significantly slower than stock pandas. The operation is used "left" type of join, which is implemented with distributed way, but implementation looks non effective.

The results for Ray execution engine are follows (112 workers is used for Ray):

The table with increasing number of rows, number of columns is fixed.

Shape(rows, cols)	(1k, 32)	(10k, 32)	(100k, 32)	(1m, 32)	(10m, 32)	(100m, 32)
pandas	0.0049	0.0089	0.0666	0.9076	13.76358	183.1597
modin	1.4042	1.1947	1.2528	4.4858	115.10042	KILLED

The table with increasing number of columns, number of rows is fixed.

Shape(rows, cols)	(1m, 32)	(1m, 100)	(1m, 1k)
pandas	0.9076	3.2592	35.5945
modin	1.2528	34.6418	949.0544

Connected task for other join types: https://github.com/modin-project/modin/issues/656

Mar 04 '22 12:03 prutskov

Related one #3588.

Mar 04 '22 14:03 YarShev

I believe this is something @RehanSD should be working soon™.

Aug 29 '22 22:08 vnlitvinov

@anmyachev did some work on this too so he and @RehanSD should contact each on the matter. Also, this blog might be useful for the issue - https://medium.com/analytics-vidhya/4-performance-improving-techniques-to-make-spark-joins-10x-faster-2ec8859138b4.

Sep 05 '22 07:09 YarShev

@arunjose696, can you revisit this?

Jan 19 '24 16:01 YarShev

A lot of improvements have been introduced recently to merge. Visit https://github.com/modin-project/modin/pull/6966 for the measurements. File a new issue if someone sees poor performance on a certain case of merge.

Mar 20 '24 09:03 YarShev

modin modin copied to clipboard

`merge` operation is significantly slower than stock pandas

System information

Describe the problem

modin
modin copied to clipboard