dfply Performance issue

Performance issue

Open Make42 opened this issue 3 years ago • 0 comments

I have a pandas DataFrame that contains experiment results. The experiment setups are described via the groupcols columns (string, float and integer columns) and the evaluation with the eval_val column (float column). I want to find the best result for each experiment type, so for all experiment with the same setup. For that I wrote three pipelines the have the same final DataFrame as a result:

from time import time
groupcols: list

t0 = time()
res_best1 = (res_long >>
            dp.group_by(*groupcols) >>
            dp.filter_by(X.eval_val == dp.colmax(X.eval_val)) >>
            dp.ungroup() >>
            dp.distinct()). \
    reset_index(drop=True)
print(time() - t0)

t0 = time()
res_best2 = (res_long >>
             dp.arrange(X.eval_val) >>
             dp.group_by(*groupcols) >>
            dp.head(1) >>
            dp.ungroup()). \
    reset_index(drop=True)
print(time() - t0)

t0 = time()
res_best3 = res_long.sort_values('eval_val', ascending=False).groupby(groupcols).first().reset_index()
print(time() - t0)

While the first setup takes about 54.5 seconds to run, the second only takes about 35.1 seconds to run, but - and that is what I want to report - the last pipeline takes only 0.073 seconds to run. So, pandas is A LOT faster than dfply. Maybe this is a bug...?

Apr 17 '21 13:04 Make42

dfply dfply copied to clipboard

Performance issue

dfply
dfply copied to clipboard