stringdist icon indicating copy to clipboard operation
stringdist copied to clipboard

Allow for q-gram based distances with multiple q's

Open markvanderloo opened this issue 9 years ago • 3 comments

For example

stringdist("hello","world",method="cosine", q=1:2)

would yield the cosine distance over the concatenation of 1-gram and 2-gram profiles.

This would also enhance compatibility, e.g. with the textcat package.

markvanderloo avatar Dec 07 '15 13:12 markvanderloo

Somewhat related: would be nice if you could get chisq distances on counts of qgram occurrences within strings.

wdkrnls avatar May 23 '20 15:05 wdkrnls

How is the chisq distance defined?

markvanderloo avatar May 23 '20 15:05 markvanderloo

I've seen it defined two ways:

  1. the Chisq test statistic itself [0,Inf]
  2. the cummulative density function applied at the Chisq test statistic [0,1]

This appears to be heavily used in Machine Learning applications: https://www.geeksforgeeks.org/chi-square-distance-in-python/

Considering that all of my favorite metrics already in stringdist are defined between [0,1] I prefer (2).

Here is some code I was using to explore it:

freq_table = function(a, b) {
  # verify that the two sequences are plausibly comparable
  u = sort(union(a, b))
  n = length(u)
  x = tabulate(factor(a, levels = u), nbins = n)
  y = tabulate(factor(b, levels = u), nbins = n)
  f = rbind(x, y)
  rownames(f) <- c("a", "b")
  colnames(f) <- u
  f
}

chisq_distance = function(a, b) {
    freq = freq_table(a, b)
    A = freq[1,,drop=TRUE]
    B = freq[2,,drop=TRUE]
    E = 0.5*(A + B)
    V = (A - B)^2
    k = ncol(freq)
    stat = sum(V/E)
    pchisq(stat, df = k)
}

I was also thinking you could similarly define a more flexible "Binomial distance" for which you would have to additionally specify the expected error rate. This allows you to declare strings as being close (near 0) even if they have several "typos".

binomial_distance = function(theta) {
  function(a, b) {
    freq = freq_table(a, b)
    err  = sum(abs(diff(freq)))
    n    = sum(freq)
    pbinom(err - 1, size = n, prob = theta)
  }
}

I was running this with qgrams like:

explode = function(x) {
  n = nrow(x)
  lapply(seq(n), function(i) {
  r = x[i,,drop=FALSE]
  rep(colnames(r), r)
  })
}

qgram = qgrams(.list = list("the cows in the barn",
				"the plows in the barn",
				"the cows in the yard",
				"the cows in the bard"), 
		             q = 2)
xplod = explode(qgram)
binomial_distance(0.1)(xplod[[1]], xplod[[2]])

Of course, I prefer the way things are done in stringdist to these functions.

wdkrnls avatar May 25 '20 12:05 wdkrnls