fuzzyjoin
fuzzyjoin copied to clipboard
add vignette on nonequi joins
Here is short vignette in response to your call, showing a use that seems in demand, but not easily available elsewhere, cf. https://github.com/hadley/dplyr/issues/557 and http://stackoverflow.com/q/41132081/1036500. Let me know what you think!
Current coverage is 84.65% (diff: 100%)
@@ master #20 diff @@
==========================================
Files 7 9 +2
Lines 327 443 +116
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 273 375 +102
- Misses 54 68 +14
Partials 0 0
Powered by Codecov. Last update 972a7f3...52be550
Wow! I really didn't realize it was this simple (I've tried non-equi joins before, but with solutions that weren't as concise).
Before merging, I wonder if there's a way to turn this into a ineq_join
family of functions. I suppose it wouldn't end up much more concise than this, but I like that it would make it clearer to the reader what was happening (and not making the end user regularly use match_fun
).
Thanks for taking a look, the credit goes to the author of this SO answer.
For a specific function, do you mean something that might work like this?
ineq_join(x, y,
join_by = c("x1" >= "y1",
"x1" <= "y2"))
I'm looking at this again while planning a CRAN release. I'm starting to think if we do encourage ineq
joins, there should be a function provided, and furthermore that it should use data.table
as a backend, since it's much faster.
One example, of joining two tables of size 1000 each, indicates data.table can be ~100X faster:
library(nycflights13)
library(fuzzyjoin)
library(data.table)
f <- head(flights, 1000)
library(microbenchmark)
mb <- microbenchmark(fj = fuzzy_left_join(f, f, by = c("hour" = "hour", "minute" = "minute"), match_fun = list(`==`, `>=`)),
dt = setDT(f)[setDT(f), on = .(hour == hour, minute >= minute), allow.cartesian = TRUE],
times = 5)
mb
Results on my machine:
Unit: milliseconds
expr min lq mean median uq max neval cld
fj 2492.45894 2658.27418 2708.33949 2663.66922 2803.16895 2924.12617 5 b
dt 13.97189 16.14638 28.32267 16.54213 33.72303 61.22995 5 a
I don't mind taking on data.table as a dependency (probably IMPORTS, though could be SUGGESTS with a check at the start of the function) since there are likely other opportunities to use it to speed up functions.
That's great to know about the speed up from data.table. I'll have a go at making an ineq
function and some tests in my fork, but if you best me to it I won't be upset 😄
I've got most of this implemented now, so I'll go ahead with finishing it!
Sounds good, thanks!
Did this make it into the current CRAN version? I'm not sure that I can find it in the docs. Thanks!
Unfortunately not yet- I think I'm going to submit a CRAN version today (there are some long-running bug fixes and new features) and then get back to work on this for the next version. I'd rather have it all complete though, and again I really do appreciate the vignette!
Righto, thanks for the update. I'm looking forward to the next CRAN release!
What happened to this update :D