datar
datar copied to clipboard
filter_and other functions take minutes when the data has 10m rows.
I have a data whose shape is (10M,15)
It takes 15minutes to run a simple distinct
data >> distinct(f.model)
When doing it with simple operations (e.g. no mutations), raw pandas would be ~3x faster:
from timeit import timeit
import pandas as pd
from datar import f
from datar.dplyr import distinct
df = pd.DataFrame(
{
"A": range(5000),
"B": 1,
"C": 2,
"D": 3,
"E": 4,
"F": 5,
"G": 6,
"H": 7,
"I": 8,
"J": 9,
"K": 10,
"L": 11,
"M": 12,
"N": 13,
"O": 14,
}
)
df = pd.concat([df] * 200)
def a():
return distinct(df, f.A)
def b():
return df.drop_duplicates(subset=["A"])
# ta = timeit("a()", number=10, globals=globals())
# print(ta) # 1.4146980000004987
# tb = timeit("b()", number=10, globals=globals())
# print(tb) # 0.5306366000004346
However, when you have mutations, distinct is comparable to pandas:
def ax():
return distinct(df, f.A + 1, f.B + 2)
def bx():
df2 = df.copy()
df2["A"] = df2["A"] + 1
df2["B"] = df2["B"] + 1
# make the columns unique
uniq = df2.drop_duplicates()
return uniq
ta = timeit("ax()", number=10, globals=globals())
print(ta) # 3.463686499999312
tb = timeit("bx()", number=10, globals=globals())
print(tb) # 3.977360900003987
distinct() is now optimized when it's on existing columns. Now the time spent for a(), b(), ax(), and bx():
a(): 1.0249294309978723
b(): 0.7078308410054888
ax(): 6.315244514997175
bx(): 5.231293500000902