fuzzyjoin icon indicating copy to clipboard operation
fuzzyjoin copied to clipboard

add vignette on nonequi joins

Open benmarwick opened this issue 7 years ago • 11 comments

Here is short vignette in response to your call, showing a use that seems in demand, but not easily available elsewhere, cf. https://github.com/hadley/dplyr/issues/557 and http://stackoverflow.com/q/41132081/1036500. Let me know what you think!

benmarwick avatar Dec 14 '16 08:12 benmarwick

Current coverage is 84.65% (diff: 100%)

Merging #20 into master will increase coverage by 1.16%

@@             master        #20   diff @@
==========================================
  Files             7          9     +2   
  Lines           327        443   +116   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits            273        375   +102   
- Misses           54         68    +14   
  Partials          0          0          

Powered by Codecov. Last update 972a7f3...52be550

codecov-io avatar Dec 15 '16 04:12 codecov-io

Wow! I really didn't realize it was this simple (I've tried non-equi joins before, but with solutions that weren't as concise).

Before merging, I wonder if there's a way to turn this into a ineq_join family of functions. I suppose it wouldn't end up much more concise than this, but I like that it would make it clearer to the reader what was happening (and not making the end user regularly use match_fun).

dgrtwo avatar Dec 15 '16 05:12 dgrtwo

Thanks for taking a look, the credit goes to the author of this SO answer.

For a specific function, do you mean something that might work like this?

ineq_join(x, y, 
          join_by = c("x1" >= "y1",   
                      "x1" <= "y2"))

benmarwick avatar Dec 15 '16 06:12 benmarwick

I'm looking at this again while planning a CRAN release. I'm starting to think if we do encourage ineq joins, there should be a function provided, and furthermore that it should use data.table as a backend, since it's much faster.

One example, of joining two tables of size 1000 each, indicates data.table can be ~100X faster:

library(nycflights13)
library(fuzzyjoin)
library(data.table)

f <- head(flights, 1000)

library(microbenchmark)

mb <- microbenchmark(fj = fuzzy_left_join(f, f, by = c("hour" = "hour", "minute" = "minute"), match_fun = list(`==`, `>=`)),
                     dt = setDT(f)[setDT(f), on = .(hour == hour, minute >= minute), allow.cartesian = TRUE],
                     times = 5)

mb

Results on my machine:

Unit: milliseconds
 expr        min         lq       mean     median         uq        max neval cld
   fj 2492.45894 2658.27418 2708.33949 2663.66922 2803.16895 2924.12617     5   b
   dt   13.97189   16.14638   28.32267   16.54213   33.72303   61.22995     5  a 

I don't mind taking on data.table as a dependency (probably IMPORTS, though could be SUGGESTS with a check at the start of the function) since there are likely other opportunities to use it to speed up functions.

dgrtwo avatar Jan 16 '17 15:01 dgrtwo

That's great to know about the speed up from data.table. I'll have a go at making an ineq function and some tests in my fork, but if you best me to it I won't be upset 😄

benmarwick avatar Jan 17 '17 01:01 benmarwick

I've got most of this implemented now, so I'll go ahead with finishing it!

dgrtwo avatar Jan 17 '17 04:01 dgrtwo

Sounds good, thanks!

benmarwick avatar Jan 17 '17 04:01 benmarwick

Did this make it into the current CRAN version? I'm not sure that I can find it in the docs. Thanks!

benmarwick avatar Jun 12 '17 21:06 benmarwick

Unfortunately not yet- I think I'm going to submit a CRAN version today (there are some long-running bug fixes and new features) and then get back to work on this for the next version. I'd rather have it all complete though, and again I really do appreciate the vignette!

dgrtwo avatar Jun 19 '17 19:06 dgrtwo

Righto, thanks for the update. I'm looking forward to the next CRAN release!

benmarwick avatar Jun 20 '17 02:06 benmarwick

What happened to this update :D

CGMossa avatar Mar 27 '22 14:03 CGMossa