stringdist icon indicating copy to clipboard operation
stringdist copied to clipboard

Question about stringdist()

Open JackGuo15 opened this issue 1 year ago • 2 comments

Hi Mark,

I hope you're doing well. My name is Ruohan, and I'm a second-year PhD student at UCL. I'm currently using the stringdist package to measure linguistic distance, and I’ve found it incredibly useful. However, I’ve encountered a few issues that I’m hoping you can help clarify.

I’ve been working through the R manual for stringdist (https://cran.r-project.org/web/packages/stringdist/stringdist.pdf), which discusses how different edits (deletion, insertion, substitution, transposition) can be weighted (on page 20). For example, in the case stringdist('ab', 'ba', weight=c(1,1,1,0.5)), the output is "0.5," suggesting that a transposition was performed.

Building on this example, I tried the following cases:

  • 1. stringdist('ab', 'a', weight=c(0.5, 1, 1, 1))
  • I expected an output of "0.5" due to a weighted deletion, but the output was "1."
  • 2. stringdist('ab', 'a', weight=c(1, 0.5, 1, 1))
  • This returned "0.5," which seems to indicate an insertion rather than a deletion.
  • 3. stringdist('a', 'ab', weight=c(0.5, 1, 1, 1))
  • Here, I received the "0.5" output, indicating a weighted deletion.

Given these results, I’m wondering if I might have misunderstood the string distance calculation. Specifically, I assumed that stringdist('ab', 'a') would attempt to match 'ab' to 'a' by deleting a character, while stringdist('a', 'ab') would result in an insertion. Could you clarify how the algorithm determines whether to apply an insertion or deletion in these cases?

Additionally, when I tried stringdist('abc', 'ca', method = "dl", weight = c(1, 0.1, 0.01, 0.001)), I received an output of "0.002," which suggests that two transpositions were performed to match "abc" to "ca." Shouldn’t this also involve a deletion or insertion?

I look forward to your insights. Thank you very much for your time.

Best wishes, Ruohan

JackGuo15 avatar Aug 28 '24 17:08 JackGuo15

Thank you for reporting this. Unfortunately I am currently not in the position to look into this.

markvanderloo avatar Oct 29 '24 14:10 markvanderloo

Hi Mark,

Thank you for letting me know. Have a nice day!

Best wishes, Ruohan

JackGuo15 avatar Nov 18 '24 11:11 JackGuo15