StringDistances.jl icon indicating copy to clipboard operation
StringDistances.jl copied to clipboard

NaN (or ArgumentError) from QGram distances for short strings

Open robertfeldt opened this issue 2 years ago • 4 comments
trafficstars

Some QGram distances today return NaN if one of the input strings are shorter than the q-gram length (q) while others don't:

julia> using StringDistances

julia> isnan(Cosine(2)("", "bb"))
true

julia> isnan(Cosine(2)("a", "bb"))
true

julia> Jaccard(2)("a", "bb")
1.0

julia> filter(d -> isnan(d(2)("", "bb")), [QGram, Cosine, Jaccard, Overlap, SorensenDice, MorisitaOverlap, NMD])
3-element Vector{DataType}:
 Cosine
 Overlap
 MorisitaOverlap

Maybe it is better to have a consistent behaviour for such inputs? Returning an ArgumentError might be better and then the caller has to decide how to handle such situations.

robertfeldt avatar Dec 20 '22 08:12 robertfeldt

Note that it goes even deeper since all but QGram returns NaN if both inputs are shorter than q:

julia> filter(d -> isnan(d(2)("", "")), [QGram, Cosine, Jaccard, Overlap, SorensenDice, MorisitaOverlap, NMD])
6-element Vector{DataType}:
 Cosine
 Jaccard
 Overlap
 SorensenDice
 MorisitaOverlap
 NMD

julia> QGram(1)("", "")
0

julia> QGram(2)("a", "b")
0

robertfeldt avatar Dec 20 '22 08:12 robertfeldt

I am not sure there is an issue with the current implementation. The way I think about it is that there is a formula for each Qgram distance (given in the docs), which is valid even when the set of qgrams is empty. In some distances, the length of qgrams appears in the denominator, which is why the distance returns NaN when the set of qgrams is empty.

matthieugomez avatar Dec 22 '22 14:12 matthieugomez

Mathematically I agree.

I see dangers in actual use but ok, people will have to handle it themselves. I guess a simple solution might be to just highlight somewhere in the documentation that one can add a safe evaluate method like so:

julia> function safeevaluate(D::Union{Cosine, Overlap, MorisitaOverlap}, s1, s2)
           length(s1) >= D.q && length(s2) >= D.q && return(evaluate(D, s1, s2))
           throw(ArgumentError("An argument is shorter than q ($(D.q)): \"$s1\", \"$s2\""))
       end
safeevaluate (generic function with 1 method)

julia> D = Cosine(2)
Cosine(2)

julia> @assert safeevaluate(D, "aa", "bb") == evaluate(D, "aa", "bb")

julia> safeevaluate(D, "", "bb")
ERROR: ArgumentError: An argument is shorter than q (2): "", "bb"
Stacktrace:
 [1] safeevaluate(D::Cosine, s1::String, s2::String)
   @ Main ./REPL[2]:3
 [2] top-level scope
   @ REPL[5]:1

julia> function safeevaluate(D::Union{Jaccard, SorensenDice, NMD}, s1, s2)
           (length(s1) >= D.q || length(s2) >= D.q) && return(evaluate(D, s1, s2))
           throw(ArgumentError("An argument is shorter than q ($(D.q)): \"$s1\", \"$s2\""))
       end
safeevaluate (generic function with 2 methods)

julia> safeevaluate(Jaccard(2), "", "")
ERROR: ArgumentError: An argument is shorter than q (2): "", ""
Stacktrace:
 [1] safeevaluate(D::Jaccard, s1::String, s2::String)
   @ Main ./REPL[11]:3
 [2] top-level scope
   @ REPL[12]:1

robertfeldt avatar Dec 22 '22 15:12 robertfeldt

Sure. I'm also open to change things — following what other libraries typically do in this case. I will leave this issue open.

matthieugomez avatar Dec 22 '22 16:12 matthieugomez