pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

[ENH] Perf for left/right join when `sort_by_appearance` is False

Open samukweku opened this issue 2 years ago • 2 comments

PR Description

Please describe the changes proposed in the pull request:

  • left/right join performance improvement when sort_by_appearance is False
  • order is ignored, which is why it is faster
np.random.seed(3)
dd = pd.DataFrame({'value':np.random.randint(100000, size=50_000)})
df = pd.DataFrame({'start':np.random.randint(100000, size=10_000),
                   'end':np.random.randint(100000, size=10_000)})
# dev
In [6]:  %timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True, how = 'left', sort_by_appearance=False)
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
9.41 s ± 405 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]:  %timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True, how = 'right', sort_by_appearance=False)
18.2 s ± 197 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

 
# PR
In [10]: %timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True, how = 'left', sort_by_appearance=False)
2.83 s ± 33.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True, how = 'right', sort_by_appearance=False)
2.83 s ± 76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR relates to #1102 .

PR Checklist

Please ensure that you have done the following:

  1. [x] PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. [x] If you're not on the contributors list, add yourself to AUTHORS.md.
  1. [x] Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

  • @ericmjl

samukweku avatar Sep 13 '22 23:09 samukweku

🚀 Deployed on https://deploy-preview-1170--pyjanitor.netlify.app

ericmjl avatar Sep 13 '22 23:09 ericmjl

Codecov Report

Merging #1170 (afd52c3) into dev (1914eb5) will increase coverage by 0.03%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev    #1170      +/-   ##
==========================================
+ Coverage   97.58%   97.61%   +0.03%     
==========================================
  Files          78       78              
  Lines        3556     3571      +15     
==========================================
+ Hits         3470     3486      +16     
+ Misses         86       85       -1     

codecov[bot] avatar Sep 14 '22 00:09 codecov[bot]

LGTM! I'm going to merge.

ericmjl avatar Sep 24 '22 18:09 ericmjl