pyjanitor
pyjanitor copied to clipboard
[ENH] Perf for left/right join when `sort_by_appearance` is False
PR Description
Please describe the changes proposed in the pull request:
- left/right join performance improvement when
sort_by_appearance
is False - order is ignored, which is why it is faster
np.random.seed(3)
dd = pd.DataFrame({'value':np.random.randint(100000, size=50_000)})
df = pd.DataFrame({'start':np.random.randint(100000, size=10_000),
'end':np.random.randint(100000, size=10_000)})
# dev
In [6]: %timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True, how = 'left', sort_by_appearance=False)
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
9.41 s ± 405 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: %timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True, how = 'right', sort_by_appearance=False)
18.2 s ± 197 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# PR
In [10]: %timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True, how = 'left', sort_by_appearance=False)
2.83 s ± 33.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: %timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True, how = 'right', sort_by_appearance=False)
2.83 s ± 76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This PR relates to #1102 .
PR Checklist
Please ensure that you have done the following:
- [x] PR in from a fork off your branch. Do not PR from
<your_username>
:dev
, but rather from<your_username>
:<feature-branch_name>
.
- [x] If you're not on the contributors list, add yourself to
AUTHORS.md
.
- [x] Add a line to
CHANGELOG.md
under the latest version header (i.e. the one that is "on deck") describing the contribution.- Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.
Automatic checks
There will be automatic checks run on the PR. These include:
- Building a preview of the docs on Netlify
- Automatically linting the code
- Making sure the code is documented
- Making sure that all tests are passed
- Making sure that code coverage doesn't go down.
Relevant Reviewers
Please tag maintainers to review.
- @ericmjl
🚀 Deployed on https://deploy-preview-1170--pyjanitor.netlify.app
Codecov Report
Merging #1170 (afd52c3) into dev (1914eb5) will increase coverage by
0.03%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## dev #1170 +/- ##
==========================================
+ Coverage 97.58% 97.61% +0.03%
==========================================
Files 78 78
Lines 3556 3571 +15
==========================================
+ Hits 3470 3486 +16
+ Misses 86 85 -1
LGTM! I'm going to merge.