stringdist request: could stringdistmatrix return a sparse matrix of logicals

Let s_vec be some vector of N distinct strings. When N is too large, stringdistmatrix grows unwieldy (NxN), as does the "dist" struct returned by stringdistmatrix when called w a single arg.

Would like to request a new function, similar to stringdistmatrix, but which would return the information in a sparse way, for those cases one is only interested in (i,j) lower-triangular pairs satisfying a condition:

s_vec <- c("string1","string2", ...)
N <- length(s_vec)
df <- stringdist_thresh(s_vec, method="lv", thresh=2, op="<=")

The above will return a tibble w/ columns "i", "j", and "dist". Each row will indicate that for some (i,j,dist) the predicate (in this case: dist <= 2) was satisfied. i in 1 to N-1, j in i+1 to N

"op" can be one of: "<", "<=", ">", ">=", "==", "!=".

As an example, consider:

s_vec <- c("aaa","aab", "aac","abc")
df <- stringdist_thresh(s_vec, method="lv", thresh=2L, op="<") 
#> tibble:
#> i, j, dist
#> 1 2 1
#> 1 3 1
#> 2 3 1
#> 3 4 1

Notice how the data frame indicates only those pairs for which the distance satisifies lv < 2.

Mar 12 '19 00:03 dan-reznik

Hi dan,

Interesting idea. I did not hink about that yet.

A few things that come to mind while I commute: leave out the components with small distances (usually relatively few) or those with large distances? It depends on the application what is preferred I suppose.

The dist structure is more than 50% sparse since only elements below the diagonal are stored (it was a nice puzzle to parallelize that btw).

I will not let stringdist return a tibble. A matrix format or something that can easily be converted to one of the formats of the Matrix package (or to tibble) seems more flexible.

I have been thinking about a topn function, similar to gower::gower_topn. Maybe that would be even better?

Mar 12 '19 05:03 markvanderloo

Mark, the dist structure, though lower triangular, is also O(N^2), so it won't solve the prob for large N. Before the above post I had posted (and deleted) one using R's sparse "Matrix" structure, which might be preferable to you as a return class. Would love if you could implement something like the code below in C++. Notice class Matrix is now compatible with most graph and linear algebra algorithms in R.

stringdistsparse <- function(s_vec, method="lv", thresh=3,
                             cmp=c("<", ">", "<=", ">=", "==", "!="))
{
   N <- length(s_vec)
   # N x N sparse, symmetric "zero" mtx 
   m <- Matrix(0L,nrow=N,ncol=N,sparse=T,
               dimnames=list(NULL,NULL) # forces it to be symmetric
               )
   cmp_fn <- .Primitive(cmp) # turns string into binary operator
   for (i in 1:(N-1)) # too slow in R, needs C++
      for (j in (i+1):N) { # only checks lower triangular elements
         d <- stringdist(s_vec[i],s_vec[j],method=method)
         if (cmp_fn(d,thresh)) # apply "cmp" to args
            m[i,j] <- 1L # or m[i,j] <- d, perhaps via parameter?
      }
   m # size of sparse m will be proportional to the number of non-zero entries
}

in my case, i would call it and transform the output to a graph:

library(dplyr)
library(igraph)
library(stringdist)

s_vec <- ... # some large vector of strings, N >> 1000
m <- stringdistsparse(s_vec, method="dl", thresh=3, op="<")
g <- graph_from_adjacency_matrix(m, mode="undirected") # this accepts sparse mtx
# do something with graph

does this make sense? i ran into problems above when N was large, and therefore will need the sparse structure

Mar 12 '19 10:03 dan-reznik

I did do something similar once. The code I used is below. I used this to cluster the strings (using buth dbscan and graph methods).

library(lvec)
library(stringdist)

a <- sample(c("jan", "pier", "tjorres", "korneel"), 1E3, replace = TRUE)
b <- sample(c("jan", "pier", "joris", "corneel"), 1E2, replace = TRUE)

chunks <- lvec::chunk(a, chunk_size = 1E1)

dist <- lapply(chunks, function(chunk, a, b, threshold, ...) {
  i <- seq(chunk[1], chunk[2])
  j <- seq_along(b)
  res <- expand.grid(i=i, j=j)
  res$dist <- stringdist(a[res$i], b[res$j])
  res <- res[res$dist <= threshold, ]
  res
}, a=a, b=b, threshold = 2)

dist <- do.call(rbind, dist)

Result:

> head(dist)
   i j dist
8  8 1    0
12 2 2    1
16 6 2    1
22 2 3    1
26 6 3    1
32 2 4    1

I don't think this will get much faster by writing this in C, except that in principle stringdist can cut short when the distance is above the threshold. So, a max_dist option would be nice: this will result in stringdist returning max_dist when the distance is above or equal to max_dist.

Mar 12 '19 11:03 djvanderlaan

Interesting implementation (sorry i closed your issue accidentally, don't know how to reopen it)!

Notice my request is not about speed (time complexity will be O(N^2) regardless), but about space, which you address in your implementation (sparse matrices have an added benefit: they compress their columns using run length encoding, so will be even smaller than data frames). but speed will also benefit from a C++ implementation, as loops in R are absolutely horrible.

currently, as you known, stringdistmatrix() needs to allocate O(N^2) doubles, where N is the length of the string vector, even when it returns a "dist". When N ~ 1 million it breaks down.

Just for reference: I have recently created a package to cluster strings by edit-distance using graph algos (compatible w/ sparse matrices), and stringdistmatrix, and this is how I ran into the problem: https://github.com/dan-reznik/clustringr

If we can get this C++ functionality, clustringr will become applicable to much larger sets of strings.

Mar 12 '19 11:03 dan-reznik

I think that Jan's implementation does exactly that: save space. Since stringdist offloads everything to paralellized C code quickly, I agree that it is not likely that we get much speedup by implementing this in C.

Regarding the max_dist argument. This is not entirely trivial for weighted edit-based distances. And not possible/meaningful for the other distances.

It would also slow down the computations on short strings by a noticeable amount. The edit distance calculations are really just some basic integer calculations a nested loop, and max_dist checking would be somewhere in the loop of the dynamic programming algorithm. It would be one of the few simple statements in there.

Oct 11 '19 20:10 markvanderloo

The simplest way would be to do

a <- vec
b <- vec

And then run my code, but it can be more efficient as now you are comparing the first element to the second and the second to the first which is unnecessary. **Note: not all distances are symmetric so stringdist(a, b) isn't necessarily equal to stringdist(b, a). **

To make more efficient you could follow the line res <- expand.grid(i=i, j=j) with

res <- res[res$j > res$i, ]

However, even more easy, it to use my reclin2 package:

library(reclin2)
vec <- c("apple","aple","banan","bananan")
# Put vector in data.frame as that is what reclin2 expects
dta <- data.table(str = vec)
p <- pair_minsim(dta, on = "str", deduplication = TRUE,
  default_comparator = jaro_winkler(), minsim = 0)

If you want to remove duplicate entries, you can follow with

p[, selection := simsum > 0.8]
deduplicate_equivalence(p, "group", "selection")

Resulting in

       str group
1:   apple     2
2:    aple     2
3:   banan     4
4: bananan     4

Note that reclin2 uses similarity scores instead of string distances by default. It is however easy to provide your own

my_dist <- function(x, y, ...) stringdist(x, y)
p <- pair_minsim(dta, on = "str", deduplication = TRUE,
  default_comparator = my_dist, minsim = 0)

Edit: for more info see https://cran.r-project.org/web/packages/reclin2/vignettes/deduplication.html

Mar 01 '22 13:03 djvanderlaan

Dear @djvanderlaan

I am so impressed. I have 400000 strings to compare with each other and your packages might be the hope I was always looking for. My apologies; while you were writing your answer, I transferred the question here: https://github.com/djvanderlaan/lvec/issues/16.

Your function is beautiful. Thank you for making art. I guess. Can I simply write if I want to use the Levenshtein distance from stringdist?

my_dist <- function(x, y, ...) stringdist(x, y, method=c("lv"))
p <- pair_minsim(dta, on = "str", deduplication = TRUE,
                 default_comparator = my_dist, minsim = 0)

Mar 01 '22 13:03 loukesio

Yes that should work. 400000 is a lot. That is going to take time.

There are a couple of ways to speed up/reduce memory use: by specifying minsim you can tell pair_minsim to only keep pairs with a similarity above minsim: note that in that case you have to transform the Levenshtein distance to a similarity score (higher = more similar, e.g. use -Levenshtein). Second: Levenshtein is not the fastest metric to compute; Jaro and Jaro-Winkler are much faster.

Mar 01 '22 13:03 djvanderlaan

Dear @djvanderlaan I hope you are doing well. My apologies for coming with another question!

The following function works perfectly! However when I work with 100.000 elements I am taking the following error

my_dist <- function(x, y, ...) stringdist(x, y, method=c("lv"))
p <- pair_minsim(dta, on = "str", deduplication = TRUE,
                 default_comparator = my_dist, minsim = 0)

Warning message:
In nx * ny : NAs produced by integer overflow

Do you have an idea how to convert these into numeric or double to not integer overflow?

Thank you for your time

Mar 08 '22 18:03 loukesio

@loukesio Better to report these kind of things at the reclin2 repo as this has nothing to do with stringdist.

This is a bug in pair_minsim. This is fixed now. Until a new version is submitted to CRAN you can install the most recent version from github using devtools::install_github("djvanderlaan/reclin2").

Mar 08 '22 20:03 djvanderlaan

Thank you @djvanderlaan for your time! Yes, you are right! I appreciate. Have a nice day!

Mar 09 '22 07:03 loukesio

stringdist stringdist copied to clipboard

request: could stringdistmatrix return a sparse matrix of logicals

stringdist
stringdist copied to clipboard