dfply
dfply copied to clipboard
Performance issue
I have a pandas DataFrame that contains experiment results. The experiment setups are described via the groupcols
columns (string, float and integer columns) and the evaluation with the eval_val
column (float column). I want to find the best result for each experiment type, so for all experiment with the same setup. For that I wrote three pipelines the have the same final DataFrame as a result:
from time import time
groupcols: list
t0 = time()
res_best1 = (res_long >>
dp.group_by(*groupcols) >>
dp.filter_by(X.eval_val == dp.colmax(X.eval_val)) >>
dp.ungroup() >>
dp.distinct()). \
reset_index(drop=True)
print(time() - t0)
t0 = time()
res_best2 = (res_long >>
dp.arrange(X.eval_val) >>
dp.group_by(*groupcols) >>
dp.head(1) >>
dp.ungroup()). \
reset_index(drop=True)
print(time() - t0)
t0 = time()
res_best3 = res_long.sort_values('eval_val', ascending=False).groupby(groupcols).first().reset_index()
print(time() - t0)
While the first setup takes about 54.5 seconds to run, the second only takes about 35.1 seconds to run, but - and that is what I want to report - the last pipeline takes only 0.073 seconds to run. So, pandas is A LOT faster than dfply. Maybe this is a bug...?