StringDistances.jl
StringDistances.jl copied to clipboard
NaN (or ArgumentError) from QGram distances for short strings
Some QGram distances today return NaN if one of the input strings are shorter than the q-gram length (q) while others don't:
julia> using StringDistances
julia> isnan(Cosine(2)("", "bb"))
true
julia> isnan(Cosine(2)("a", "bb"))
true
julia> Jaccard(2)("a", "bb")
1.0
julia> filter(d -> isnan(d(2)("", "bb")), [QGram, Cosine, Jaccard, Overlap, SorensenDice, MorisitaOverlap, NMD])
3-element Vector{DataType}:
Cosine
Overlap
MorisitaOverlap
Maybe it is better to have a consistent behaviour for such inputs? Returning an ArgumentError might be better and then the caller has to decide how to handle such situations.
Note that it goes even deeper since all but QGram returns NaN if both inputs are shorter than q:
julia> filter(d -> isnan(d(2)("", "")), [QGram, Cosine, Jaccard, Overlap, SorensenDice, MorisitaOverlap, NMD])
6-element Vector{DataType}:
Cosine
Jaccard
Overlap
SorensenDice
MorisitaOverlap
NMD
julia> QGram(1)("", "")
0
julia> QGram(2)("a", "b")
0
I am not sure there is an issue with the current implementation. The way I think about it is that there is a formula for each Qgram distance (given in the docs), which is valid even when the set of qgrams is empty. In some distances, the length of qgrams appears in the denominator, which is why the distance returns NaN when the set of qgrams is empty.
Mathematically I agree.
I see dangers in actual use but ok, people will have to handle it themselves. I guess a simple solution might be to just highlight somewhere in the documentation that one can add a safe evaluate method like so:
julia> function safeevaluate(D::Union{Cosine, Overlap, MorisitaOverlap}, s1, s2)
length(s1) >= D.q && length(s2) >= D.q && return(evaluate(D, s1, s2))
throw(ArgumentError("An argument is shorter than q ($(D.q)): \"$s1\", \"$s2\""))
end
safeevaluate (generic function with 1 method)
julia> D = Cosine(2)
Cosine(2)
julia> @assert safeevaluate(D, "aa", "bb") == evaluate(D, "aa", "bb")
julia> safeevaluate(D, "", "bb")
ERROR: ArgumentError: An argument is shorter than q (2): "", "bb"
Stacktrace:
[1] safeevaluate(D::Cosine, s1::String, s2::String)
@ Main ./REPL[2]:3
[2] top-level scope
@ REPL[5]:1
julia> function safeevaluate(D::Union{Jaccard, SorensenDice, NMD}, s1, s2)
(length(s1) >= D.q || length(s2) >= D.q) && return(evaluate(D, s1, s2))
throw(ArgumentError("An argument is shorter than q ($(D.q)): \"$s1\", \"$s2\""))
end
safeevaluate (generic function with 2 methods)
julia> safeevaluate(Jaccard(2), "", "")
ERROR: ArgumentError: An argument is shorter than q (2): "", ""
Stacktrace:
[1] safeevaluate(D::Jaccard, s1::String, s2::String)
@ Main ./REPL[11]:3
[2] top-level scope
@ REPL[12]:1
Sure. I'm also open to change things — following what other libraries typically do in this case. I will leave this issue open.