stringdist icon indicating copy to clipboard operation
stringdist copied to clipboard

Apparent inconsistency in output when both the number of characters of a & b are smaller than q

Open fabiocs8 opened this issue 1 year ago • 1 comments

Consider the NA correct result of:

stringdist(   "", "XXX"
              , method = "cos"
              , q = 3)

However, if both a and b have nchar() < q, the output becomes 0:

stringdist(   "", "XX"
              , method = "cos"
              , q = 3)

In my view, the output for the second case would be more consistent if it were NA also. Does it make sense?

fabiocs8 avatar Apr 30 '23 15:04 fabiocs8

Thanks, this relates somewhat to #48.

In the formal definition of the qgrams[1] distance, we compare two qgram-vectors, where the length of the vectors is equal to the number of all q-grams that can be created from a chosen alphabet (in our case, the UTF code table). This means that in the first case we have to compare $(0,0,\ldots, 0)$ with $(0,0,\ldots,1,0,0,\ldots,0)$. The cosine distance between these two vectors is

$$ 1 - \frac{\langle (0,0,\ldots, 0),(0,0,\ldots,1,0,0,\ldots,0)\rangle}{|(0,0,\ldots, 0)| |(0,0,\ldots,1,0,0,\ldots,0)| } =1 - \frac{0}{0\cdot 1} = \textrm{undefined} $$

In the second case, we get two zero-vectors as none of the possible 3-grams occur in either input strings. So we have a choice: do we state that two zero-vectors are equal (in magnitude and direction) and say the distance is zero? Or do we say 'undefined', which is what we get when we fill in the equation?

So the main point is: in the first case we have no choice but to fill in the equation. In the second case we can detect that we have two equal q-gram vectors and use that. I admit that this is subtle.

Finally, the choice seems consitent with this (method=qgram measures sum of absolute differences between qgram profiles)

> stringdist(   "", "XX", method='qgram', q=3)
[1] 0

[1] Ukkonen (1992) theoretical computer science 92 191-211

markvanderloo avatar May 30 '23 11:05 markvanderloo