fuzzyjoin Speed Up When Simple Join Works for Most Columns

Speed Up When Simple Join Works for Most Columns

Open billdenney opened this issue 4 years ago • 3 comments

Thanks for writing the package.

I have data where most columns should match exactly, but two of the columns require fuzzy matching. The data take a while to join on the 6 by columns. One data.frame is 2597 rows and the other is 44929 rows.

I started running the following code to try to match the 6 columns, and it ran several minutes before crashing because R ran out of memory.

foo <-
  fuzzyjoin::fuzzy_left_join(
    x=sdtab %>% rename(USUBJIDN=ID) %>% mutate(ROWID=row_number()),
    y=d_combined,
    by=c("USUBJIDN", intersect(names(sdtab), names(d_combined))),
    match_fun=
      list(
        USUBJIDN=function(x, y) x == y,
        PARAMN=function(x, y) x == y,
        TSFM=function(x, y) abs(x/y - 1) < 0.0001,
        AVALN=function(x, y) abs(x/y - 1) < 0.0001,
        EVID=function(x, y) x == y,
        MDV=function(x, y) x == y
      )
  )

Is there a more efficient way to perform an exact join on several columns and only perform the fuzzy join on the remaining two?

Jun 11 '20 02:06 billdenney

Here is a (much less feature-rich) consideration for a faster and less memory intensive algorithm:

fuzzy_join_2 <- function(x, y, match_fun, ...) {
  idx_df <- tibble()
  for (row_x in seq_len(nrow(x))) {
    current_mask <-
      fuzzy_join_match(
        rep(TRUE, nrow(y)),
        x=x[row_x, , drop=FALSE],
        y=y,
        match_fun=match_fun,
        ...
      )
    current_idx <- which(current_mask)
    idx_df <-
      bind_rows(
        idx_df,
        if (length(current_idx)) {
          tibble(
            x=row_x,
            y=current_idx
          )
        } else {
          tibble(
            x=row_x,
            y=NA_integer_
          )
        }
      )
  }
  extra_rows <- setdiff(seq_len(nrow(y)), idx_df$y)
  bind_rows(
    idx_df,
    if (length(extra_rows)) {
      tibble(
        x=NA_integer_,
        y=extra_rows
      )
    }
  )
}

fuzzy_join_match <- function(current_mask, x, y, match_fun, ...) {
  current_name <- names(match_fun)[1]
  current_mask[current_mask] <-
    match_fun[[1]](
      x[[current_name]],
      y[[current_name]][current_mask],
      ...
    )
  current_mask[is.na(current_mask)] <- FALSE
  if (any(current_mask) & length(match_fun) > 1) {
    current_mask <- fuzzy_join_match(current_mask, x, y, match_fun[-1], ...)
  }
  current_mask
}

Jun 11 '20 03:06 billdenney

Do you have some examples and benchmarks on your proposal? I recently tried fuzzyjoin on a small (31k and 19700) dataset, exact matching on 1 column, one <= and one >= match, but always hit out of memory 😢

Apr 14 '21 15:04 espinielli

@espinielli, I have not significantly tested my proposal. The example should be similar to my initial example (just giving an x and y along with appropriate match_fun values.

Apr 14 '21 19:04 billdenney

fuzzyjoin fuzzyjoin copied to clipboard

Speed Up When Simple Join Works for Most Columns

fuzzyjoin
fuzzyjoin copied to clipboard