modin icon indicating copy to clipboard operation
modin copied to clipboard

`merge` operation is significantly slower than stock pandas

Open prutskov opened this issue 3 years ago • 4 comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): all
  • Modin version (modin.__version__): 3f00e24b041913951e7fec8a13dff280adabb6ee
  • Python version: 3.8.11
  • Code we can use to reproduce:
Benchmark code
import time

import numpy as np
import pandas as pd

import modin.pandas as mpd
import modin.config as cfg
import ray

cfg.BenchmarkMode.put(True)
ray.init()

def generate_data(nrows=5000, n_random_cols=100):
    data1 = {
        "ID": np.random.choice(nrows, nrows, replace=False),
        "Float": np.random.rand(nrows),
    }
    data_random1 = {f"col{i}":  np.random.rand(nrows) for i in range(n_random_cols)}
    data1.update(data_random1)


    data2 = {
        "ID": np.random.choice(nrows, nrows, replace=False),
        "Float_2": np.random.rand(nrows),
    }
    data_random2 = {f"COL{i}":  np.random.rand(nrows) for i in range(n_random_cols)}
    data2.update(data_random2)
    
    df1 = mpd.DataFrame(data1)
    df2 = mpd.DataFrame(data2)

    return df1, df2

if __name__ == "__main__":
    mdf1, mdf2 = generate_data(nrows=1_000_000, n_random_cols=998)
    pdf1, pdf2 = mdf1._to_pandas(), mdf2._to_pandas()

    print(f"Original shape: {pdf1.shape}")

    t = time.time()
    pdf1 = pdf1.merge(pdf2, on="ID", how="left")
    print(f"pandas merge time: {time.time() - t} s")
    print(pdf1.shape)

    t = time.time()
    mdf1 = mdf1.merge(mdf2, on="ID", how="left")
    print(f"modin merge time: {time.time() - t} s")
    print(mdf1.shape)

Describe the problem

merge operation is significantly slower than stock pandas. The operation is used "left" type of join, which is implemented with distributed way, but implementation looks non effective.

The results for Ray execution engine are follows (112 workers is used for Ray):

  1. The table with increasing number of rows, number of columns is fixed.
Shape(rows, cols) (1k, 32) (10k, 32) (100k, 32) (1m, 32) (10m, 32) (100m, 32)
pandas 0.0049 0.0089 0.0666 0.9076 13.76358 183.1597
modin 1.4042 1.1947 1.2528 4.4858 115.10042 KILLED
  1. The table with increasing number of columns, number of rows is fixed.
Shape(rows, cols) (1m, 32) (1m, 100) (1m, 1k)
pandas 0.9076 3.2592 35.5945
modin 1.2528 34.6418 949.0544

Connected task for other join types: https://github.com/modin-project/modin/issues/656

prutskov avatar Mar 04 '22 12:03 prutskov

Related one #3588.

YarShev avatar Mar 04 '22 14:03 YarShev

I believe this is something @RehanSD should be working soon™.

vnlitvinov avatar Aug 29 '22 22:08 vnlitvinov

@anmyachev did some work on this too so he and @RehanSD should contact each on the matter. Also, this blog might be useful for the issue - https://medium.com/analytics-vidhya/4-performance-improving-techniques-to-make-spark-joins-10x-faster-2ec8859138b4.

YarShev avatar Sep 05 '22 07:09 YarShev

@arunjose696, can you revisit this?

YarShev avatar Jan 19 '24 16:01 YarShev

A lot of improvements have been introduced recently to merge. Visit https://github.com/modin-project/modin/pull/6966 for the measurements. File a new issue if someone sees poor performance on a certain case of merge.

YarShev avatar Mar 20 '24 09:03 YarShev