pyjanitor
pyjanitor copied to clipboard
[ENH] improved performance for `first/last` in `conditional_join`
PR Description
Please describe the changes proposed in the pull request:
- improved performance for
first/lastinconditional_join - applies if join columns on the right is already sorted
- limited to single non-equi joins or range joins (max two join columns)
This PR improves conditional_join.
Performance tests - YMMV:
import pandas as pd
import janitor
In [21]: df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
...: df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})
In [32]: df1 = pd.concat([df1]*10_000)
In [33]: df2 = pd.concat([df2]*200)
# this PR
In [37]: %timeit df1.conditional_join(df2.sort_values('value_2A'), ('value_1','value_2A','>='), keep='first')
3.35 ms ± 59.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [45]: %timeit df1.conditional_join(df2.sort_values(['value_2A','value_2B']), ('value_1','value_2A','>='), ('value_1','value_2B','<='),keep='first')
6.11 ms ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# dev
In [47]: %timeit df1.conditional_join(df2.sort_values(['value_2A','value_2B']), ('value_1','value_2A','>='), ('value_1','value_2B'
...: ,'<='),keep='first')
def conditional_join(
300 ms ± 5.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [48]: %timeit df1.conditional_join(df2.sort_values('value_2A'), ('value_1','value_2A','>='), keep='first')
57.6 ms ± 630 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
- @ericmjl
🚀 Deployed on https://deploy-preview-1382--pyjanitor.netlify.app