pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

[ENH] improved performance for `first/last` in `conditional_join`

Open samukweku opened this issue 1 year ago • 1 comments

PR Description

Please describe the changes proposed in the pull request:

  • improved performance for first/last in conditional_join
  • applies if join columns on the right is already sorted
  • limited to single non-equi joins or range joins (max two join columns)

This PR improves conditional_join.

Performance tests - YMMV:

import pandas as pd
import janitor
In [21]: df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
    ...: df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})

In [32]: df1 = pd.concat([df1]*10_000)

In [33]: df2 = pd.concat([df2]*200)

# this PR
In [37]: %timeit df1.conditional_join(df2.sort_values('value_2A'), ('value_1','value_2A','>='), keep='first')
3.35 ms ± 59.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [45]: %timeit df1.conditional_join(df2.sort_values(['value_2A','value_2B']), ('value_1','value_2A','>='), ('value_1','value_2B','<='),keep='first')
6.11 ms ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# dev 
In [47]: %timeit df1.conditional_join(df2.sort_values(['value_2A','value_2B']), ('value_1','value_2A','>='), ('value_1','value_2B'
    ...: ,'<='),keep='first')
  def conditional_join(
300 ms ± 5.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [48]: %timeit df1.conditional_join(df2.sort_values('value_2A'), ('value_1','value_2A','>='), keep='first')
57.6 ms ± 630 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • @ericmjl

samukweku avatar Jun 28 '24 01:06 samukweku

🚀 Deployed on https://deploy-preview-1382--pyjanitor.netlify.app

ericmjl avatar Jun 28 '24 01:06 ericmjl