fuzzyjoin
fuzzyjoin copied to clipboard
Speed Up When Simple Join Works for Most Columns
Thanks for writing the package.
I have data where most columns should match exactly, but two of the columns require fuzzy matching. The data take a while to join on the 6 by columns. One data.frame is 2597 rows and the other is 44929 rows.
I started running the following code to try to match the 6 columns, and it ran several minutes before crashing because R ran out of memory.
foo <-
fuzzyjoin::fuzzy_left_join(
x=sdtab %>% rename(USUBJIDN=ID) %>% mutate(ROWID=row_number()),
y=d_combined,
by=c("USUBJIDN", intersect(names(sdtab), names(d_combined))),
match_fun=
list(
USUBJIDN=function(x, y) x == y,
PARAMN=function(x, y) x == y,
TSFM=function(x, y) abs(x/y - 1) < 0.0001,
AVALN=function(x, y) abs(x/y - 1) < 0.0001,
EVID=function(x, y) x == y,
MDV=function(x, y) x == y
)
)
Is there a more efficient way to perform an exact join on several columns and only perform the fuzzy join on the remaining two?
Here is a (much less feature-rich) consideration for a faster and less memory intensive algorithm:
fuzzy_join_2 <- function(x, y, match_fun, ...) {
idx_df <- tibble()
for (row_x in seq_len(nrow(x))) {
current_mask <-
fuzzy_join_match(
rep(TRUE, nrow(y)),
x=x[row_x, , drop=FALSE],
y=y,
match_fun=match_fun,
...
)
current_idx <- which(current_mask)
idx_df <-
bind_rows(
idx_df,
if (length(current_idx)) {
tibble(
x=row_x,
y=current_idx
)
} else {
tibble(
x=row_x,
y=NA_integer_
)
}
)
}
extra_rows <- setdiff(seq_len(nrow(y)), idx_df$y)
bind_rows(
idx_df,
if (length(extra_rows)) {
tibble(
x=NA_integer_,
y=extra_rows
)
}
)
}
fuzzy_join_match <- function(current_mask, x, y, match_fun, ...) {
current_name <- names(match_fun)[1]
current_mask[current_mask] <-
match_fun[[1]](
x[[current_name]],
y[[current_name]][current_mask],
...
)
current_mask[is.na(current_mask)] <- FALSE
if (any(current_mask) & length(match_fun) > 1) {
current_mask <- fuzzy_join_match(current_mask, x, y, match_fun[-1], ...)
}
current_mask
}
Do you have some examples and benchmarks on your proposal?
I recently tried fuzzyjoin
on a small (31k and 19700) dataset, exact matching on 1 column, one <= and one >= match, but always hit out of memory 😢
@espinielli, I have not significantly tested my proposal. The example should be similar to my initial example (just giving an x
and y
along with appropriate match_fun
values.