stringdist
stringdist copied to clipboard
request: could stringdistmatrix return a sparse matrix of logicals
Let s_vec be some vector of N distinct strings. When N is too large, stringdistmatrix grows unwieldy (NxN), as does the "dist" struct returned by stringdistmatrix when called w a single arg.
Would like to request a new function, similar to stringdistmatrix, but which would return the information in a sparse way, for those cases one is only interested in (i,j) lower-triangular pairs satisfying a condition:
s_vec <- c("string1","string2", ...)
N <- length(s_vec)
df <- stringdist_thresh(s_vec, method="lv", thresh=2, op="<=")
The above will return a tibble w/ columns "i", "j", and "dist". Each row will indicate that for some (i,j,dist) the predicate (in this case: dist <= 2) was satisfied. i in 1 to N-1, j in i+1 to N
"op" can be one of: "<", "<=", ">", ">=", "==", "!=".
As an example, consider:
s_vec <- c("aaa","aab", "aac","abc")
df <- stringdist_thresh(s_vec, method="lv", thresh=2L, op="<")
#> tibble:
#> i, j, dist
#> 1 2 1
#> 1 3 1
#> 2 3 1
#> 3 4 1
Notice how the data frame indicates only those pairs for which the distance satisifies lv < 2.
Hi dan,
Interesting idea. I did not hink about that yet.
A few things that come to mind while I commute: leave out the components with small distances (usually relatively few) or those with large distances? It depends on the application what is preferred I suppose.
The dist structure is more than 50% sparse since only elements below the diagonal are stored (it was a nice puzzle to parallelize that btw).
I will not let stringdist return a tibble. A matrix format or something that can easily be converted to one of the formats of the Matrix package (or to tibble) seems more flexible.
I have been thinking about a topn function, similar to gower::gower_topn. Maybe that would be even better?
Mark, the dist structure, though lower triangular, is also O(N^2), so it won't solve the prob for large N. Before the above post I had posted (and deleted) one using R's sparse "Matrix" structure, which might be preferable to you as a return class. Would love if you could implement something like the code below in C++. Notice class Matrix is now compatible with most graph and linear algebra algorithms in R.
stringdistsparse <- function(s_vec, method="lv", thresh=3,
cmp=c("<", ">", "<=", ">=", "==", "!="))
{
N <- length(s_vec)
# N x N sparse, symmetric "zero" mtx
m <- Matrix(0L,nrow=N,ncol=N,sparse=T,
dimnames=list(NULL,NULL) # forces it to be symmetric
)
cmp_fn <- .Primitive(cmp) # turns string into binary operator
for (i in 1:(N-1)) # too slow in R, needs C++
for (j in (i+1):N) { # only checks lower triangular elements
d <- stringdist(s_vec[i],s_vec[j],method=method)
if (cmp_fn(d,thresh)) # apply "cmp" to args
m[i,j] <- 1L # or m[i,j] <- d, perhaps via parameter?
}
m # size of sparse m will be proportional to the number of non-zero entries
}
in my case, i would call it and transform the output to a graph:
library(dplyr)
library(igraph)
library(stringdist)
s_vec <- ... # some large vector of strings, N >> 1000
m <- stringdistsparse(s_vec, method="dl", thresh=3, op="<")
g <- graph_from_adjacency_matrix(m, mode="undirected") # this accepts sparse mtx
# do something with graph
does this make sense? i ran into problems above when N was large, and therefore will need the sparse structure
I did do something similar once. The code I used is below. I used this to cluster the strings (using buth dbscan and graph methods).
library(lvec)
library(stringdist)
a <- sample(c("jan", "pier", "tjorres", "korneel"), 1E3, replace = TRUE)
b <- sample(c("jan", "pier", "joris", "corneel"), 1E2, replace = TRUE)
chunks <- lvec::chunk(a, chunk_size = 1E1)
dist <- lapply(chunks, function(chunk, a, b, threshold, ...) {
i <- seq(chunk[1], chunk[2])
j <- seq_along(b)
res <- expand.grid(i=i, j=j)
res$dist <- stringdist(a[res$i], b[res$j])
res <- res[res$dist <= threshold, ]
res
}, a=a, b=b, threshold = 2)
dist <- do.call(rbind, dist)
Result:
> head(dist)
i j dist
8 8 1 0
12 2 2 1
16 6 2 1
22 2 3 1
26 6 3 1
32 2 4 1
I don't think this will get much faster by writing this in C, except that in principle stringdist can cut short when the distance is above the threshold. So, a max_dist option would be nice: this will result in stringdist returning max_dist when the distance is above or equal to max_dist.
Interesting implementation (sorry i closed your issue accidentally, don't know how to reopen it)!
Notice my request is not about speed (time complexity will be O(N^2) regardless), but about space, which you address in your implementation (sparse matrices have an added benefit: they compress their columns using run length encoding, so will be even smaller than data frames). but speed will also benefit from a C++ implementation, as loops in R are absolutely horrible.
currently, as you known, stringdistmatrix() needs to allocate O(N^2) doubles, where N is the length of the string vector, even when it returns a "dist". When N ~ 1 million it breaks down.
Just for reference: I have recently created a package to cluster strings by edit-distance using graph algos (compatible w/ sparse matrices), and stringdistmatrix, and this is how I ran into the problem: https://github.com/dan-reznik/clustringr
If we can get this C++ functionality, clustringr will become applicable to much larger sets of strings.
I think that Jan's implementation does exactly that: save space. Since stringdist offloads everything to paralellized C code quickly, I agree that it is not likely that we get much speedup by implementing this in C.
Regarding the max_dist argument. This is not entirely trivial for weighted edit-based distances. And not possible/meaningful for the other distances.
It would also slow down the computations on short strings by a noticeable amount. The edit distance calculations are really just some basic integer calculations a nested loop, and max_dist checking would be somewhere in the loop of the dynamic programming algorithm. It would be one of the few simple statements in there.
The simplest way would be to do
a <- vec
b <- vec
And then run my code, but it can be more efficient as now you are comparing the first element to the second and the second to the first which is unnecessary. **Note: not all distances are symmetric so stringdist(a, b) isn't necessarily equal to stringdist(b, a). **
To make more efficient you could follow the line res <- expand.grid(i=i, j=j) with
res <- res[res$j > res$i, ]
However, even more easy, it to use my reclin2 package:
library(reclin2)
vec <- c("apple","aple","banan","bananan")
# Put vector in data.frame as that is what reclin2 expects
dta <- data.table(str = vec)
p <- pair_minsim(dta, on = "str", deduplication = TRUE,
default_comparator = jaro_winkler(), minsim = 0)
If you want to remove duplicate entries, you can follow with
p[, selection := simsum > 0.8]
deduplicate_equivalence(p, "group", "selection")
Resulting in
str group
1: apple 2
2: aple 2
3: banan 4
4: bananan 4
Note that reclin2 uses similarity scores instead of string distances by default. It is however easy to provide your own
my_dist <- function(x, y, ...) stringdist(x, y)
p <- pair_minsim(dta, on = "str", deduplication = TRUE,
default_comparator = my_dist, minsim = 0)
Edit: for more info see https://cran.r-project.org/web/packages/reclin2/vignettes/deduplication.html
Dear @djvanderlaan
I am so impressed. I have 400000 strings to compare with each other and your packages might be the hope I was always looking for. My apologies; while you were writing your answer, I transferred the question here: https://github.com/djvanderlaan/lvec/issues/16.
Your function is beautiful. Thank you for making art. I guess. Can I simply write if I want to use the Levenshtein distance from stringdist?
my_dist <- function(x, y, ...) stringdist(x, y, method=c("lv"))
p <- pair_minsim(dta, on = "str", deduplication = TRUE,
default_comparator = my_dist, minsim = 0)
Yes that should work. 400000 is a lot. That is going to take time.
There are a couple of ways to speed up/reduce memory use: by specifying minsim you can tell pair_minsim to only keep pairs with a similarity above minsim: note that in that case you have to transform the Levenshtein distance to a similarity score (higher = more similar, e.g. use -Levenshtein). Second: Levenshtein is not the fastest metric to compute; Jaro and Jaro-Winkler are much faster.
Dear @djvanderlaan I hope you are doing well. My apologies for coming with another question!
The following function works perfectly! However when I work with 100.000 elements I am taking the following error
my_dist <- function(x, y, ...) stringdist(x, y, method=c("lv"))
p <- pair_minsim(dta, on = "str", deduplication = TRUE,
default_comparator = my_dist, minsim = 0)
Warning message:
In nx * ny : NAs produced by integer overflow
Do you have an idea how to convert these into numeric or double to not integer overflow?
Thank you for your time
@loukesio Better to report these kind of things at the reclin2 repo as this has nothing to do with stringdist.
This is a bug in pair_minsim. This is fixed now. Until a new version is submitted to CRAN you can install the most recent version from github using devtools::install_github("djvanderlaan/reclin2").
Thank you @djvanderlaan for your time! Yes, you are right! I appreciate. Have a nice day!