zoomerjoin icon indicating copy to clipboard operation
zoomerjoin copied to clipboard

Pass weights to `jaccard_string_group()` (or speed up the function)

Open aidanhorn opened this issue 6 months ago • 6 comments

Is your feature request related to a problem? Please describe. jaccard_string_group() takes too long on 25 million rows with about 1000 dirty categories, paring down to about 200 clean categories. But, it can process the unique dirty string vector within minutes. However, jaccard_string_group() does not pass through a weights vector to cluster_fast_greedy(), so all the dirty strings in the unique vector would have an equal weight.

Describe the solution you'd like Please include an option to pass weights to jaccard_string_group().

Describe alternatives you've considered I have copied the function and tried to include this option, but I do not have Rust installed and I'm not sure how to compile everything using Rust.

Additional context

jaccard_string_group <- function(   string,
                                    n_gram_width = 2,
                                    n_bands = 45,
                                    band_width = 8,
                                    threshold = .7,
                                    progress = TRUE,
                                    cluster_weights = NULL) {
  if (!requireNamespace("igraph")) {
    stop("library 'igraph' must be installed to run this function")
  }

  pairs <- rust_jaccard_join(string,
    string,
    ngram_width = n_gram_width,
    n_bands,
    band_size = band_width,
    threshold = threshold,
    progress = progress,
    seed = round(stats::runif(1, 0, 2^64))
  )

  graph <- igraph::graph_from_edgelist(pairs)
  if (packageVersion("igraph") < "2.0.0") {
    fc <- igraph::fastgreedy.community(igraph::as.undirected(graph))
  } else {
    fc <- igraph::cluster_fast_greedy(igraph::as.undirected(graph), weights=cluster_weights)
  }
  groups <- igraph::groups(fc)
  lookup_table <- vapply(groups, "[[", integer(1), 1)
  membership <- igraph::membership(fc)
  return(string[lookup_table[membership]])
}

aidanhorn avatar Jul 31 '24 09:07 aidanhorn