datar icon indicating copy to clipboard operation
datar copied to clipboard

filter_and other functions take minutes when the data has 10m rows.

Open antonio-yu opened this issue 3 years ago • 1 comments

I have a data whose shape is (10M,15) It takes 15minutes to run a simple distinct

data >> distinct(f.model)

antonio-yu avatar Jul 26 '22 02:07 antonio-yu

When doing it with simple operations (e.g. no mutations), raw pandas would be ~3x faster:

from timeit import timeit

import pandas as pd
from datar import f
from datar.dplyr import distinct

df = pd.DataFrame(
    {
        "A": range(5000),
        "B": 1,
        "C": 2,
        "D": 3,
        "E": 4,
        "F": 5,
        "G": 6,
        "H": 7,
        "I": 8,
        "J": 9,
        "K": 10,
        "L": 11,
        "M": 12,
        "N": 13,
        "O": 14,
    }
)

df = pd.concat([df] * 200)

def a():
    return distinct(df, f.A)


def b():
    return df.drop_duplicates(subset=["A"])

# ta = timeit("a()", number=10, globals=globals())
# print(ta)  # 1.4146980000004987

# tb = timeit("b()", number=10, globals=globals())
# print(tb)  # 0.5306366000004346

However, when you have mutations, distinct is comparable to pandas:

def ax():
    return distinct(df, f.A + 1, f.B + 2)


def bx():
    df2 = df.copy()
    df2["A"] = df2["A"] + 1
    df2["B"] = df2["B"] + 1
    # make the columns unique
    uniq = df2.drop_duplicates()
    return uniq


ta = timeit("ax()", number=10, globals=globals())
print(ta)  # 3.463686499999312

tb = timeit("bx()", number=10, globals=globals())
print(tb)  # 3.977360900003987

pwwang avatar Jul 26 '22 02:07 pwwang

distinct() is now optimized when it's on existing columns. Now the time spent for a(), b(), ax(), and bx():

a(): 1.0249294309978723
b(): 0.7078308410054888
ax(): 6.315244514997175
bx(): 5.231293500000902

pwwang avatar Aug 25 '22 17:08 pwwang