stringdist
stringdist copied to clipboard
Allow for q-gram based distances with multiple q's
For example
stringdist("hello","world",method="cosine", q=1:2)
would yield the cosine distance over the concatenation of 1-gram and 2-gram profiles.
This would also enhance compatibility, e.g. with the textcat package.
Somewhat related: would be nice if you could get chisq distances on counts of qgram occurrences within strings.
How is the chisq distance defined?
I've seen it defined two ways:
- the Chisq test statistic itself [0,Inf]
- the cummulative density function applied at the Chisq test statistic [0,1]
This appears to be heavily used in Machine Learning applications: https://www.geeksforgeeks.org/chi-square-distance-in-python/
Considering that all of my favorite metrics already in stringdist are defined between [0,1] I prefer (2).
Here is some code I was using to explore it:
freq_table = function(a, b) {
# verify that the two sequences are plausibly comparable
u = sort(union(a, b))
n = length(u)
x = tabulate(factor(a, levels = u), nbins = n)
y = tabulate(factor(b, levels = u), nbins = n)
f = rbind(x, y)
rownames(f) <- c("a", "b")
colnames(f) <- u
f
}
chisq_distance = function(a, b) {
freq = freq_table(a, b)
A = freq[1,,drop=TRUE]
B = freq[2,,drop=TRUE]
E = 0.5*(A + B)
V = (A - B)^2
k = ncol(freq)
stat = sum(V/E)
pchisq(stat, df = k)
}
I was also thinking you could similarly define a more flexible "Binomial distance" for which you would have to additionally specify the expected error rate. This allows you to declare strings as being close (near 0) even if they have several "typos".
binomial_distance = function(theta) {
function(a, b) {
freq = freq_table(a, b)
err = sum(abs(diff(freq)))
n = sum(freq)
pbinom(err - 1, size = n, prob = theta)
}
}
I was running this with qgrams like:
explode = function(x) {
n = nrow(x)
lapply(seq(n), function(i) {
r = x[i,,drop=FALSE]
rep(colnames(r), r)
})
}
qgram = qgrams(.list = list("the cows in the barn",
"the plows in the barn",
"the cows in the yard",
"the cows in the bard"),
q = 2)
xplod = explode(qgram)
binomial_distance(0.1)(xplod[[1]], xplod[[2]])
Of course, I prefer the way things are done in stringdist to these functions.