StringDistances.jl
StringDistances.jl copied to clipboard
String Distances in Julia
Installation
The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.
Supported Distances
String distances act over any pair of iterators that define length (e.g. AbstractStrings, GraphemeIterators, or AbstractVectors)
The available distances are:
- Edit Distances
- Hamming Distance
Hamming() <: SemiMetric - Jaro and Jaro-Winkler Distance
Jaro()JaroWinkler() <: SemiMetric - Levenshtein Distance
Levenshtein() <: Metric - Optimal String Alignment Distance (a.k.a. restricted Damerau-Levenshtein)
OptimalStringAlignment() <: SemiMetric - Damerau-Levenshtein Distance
DamerauLevenshtein() <: Metric - RatcliffObershelp Distance
RatcliffObershelp() <: SemiMetric
- Hamming Distance
- Q-gram distances compare the set of all substrings of length
qin each string (and which- QGram Distance
Qgram(q::Int) <: SemiMetric - Cosine Distance
Cosine(q::Int) <: SemiMetric - Jaccard Distance
Jaccard(q::Int) <: SemiMetric - Overlap Distance
Overlap(q::Int) <: SemiMetric - Sorensen-Dice Distance
SorensenDice(q::Int) <: SemiMetric - MorisitaOverlap Distance
MorisitaOverlap(q::Int) <: SemiMetric - Normalized Multiset Distance
NMD(q::Int) <: SemiMetric
- QGram Distance
Syntax
Following the Distances.jl package, string distances can inherit from two abstract types: StringSemiMetric <: SemiMetric or StringMetric <: Metric.
Computing the distance between two strings (or iterators)
You can always compute a certain distance between two strings using the following syntax
r = evaluate(dist, x, y)
r = dist(x, y)
Here, dist is an instance of a distance type: for example, the type for the Levenshtein distance is Levenshtein. You can compute the Levenshtein distance between x and y as
r = evaluate(Levenshtein(), x, y)
r = Levenshtein()(x, y)
The function compare returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns an element of type Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.
Levenshtein()("martha", "martha")
#> 0
compare("martha", "martha", Levenshtein())
#> 1.0
Computing the distance between two AbstractVectors of strings (or iterators)
Consider X and Y two AbstractVectors of iterators. You can compute the matrix of distances across elements, dist(X[i], Y[j]), as follows:
pairwise(dist, X, Y)
For instance,
pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])
pairwise is optimized in various ways (e.g., for the case of QGram-distances, dictionary of qgrams are pre-computed)
Find closest string
The package also adds convenience functions to find elements in a iterator of strings closest to a given string
-
findnearestreturns the value and index of the element initrwith the highest similarity score withs. Its syntax is:indnearest(s, itr, dist) -
findallreturns the indices of all elements initrwith a similarity score withshigher than a minimum score. Its syntax is:indall(s, itr, dist; min_score = 0.8)
The functions findnearest and findall are particularly optimized for the Levenshtein and OptimalStringAlignment distances, as these algorithm can stop early if the distance becomes higher than a certain threshold.
fuzzywuzzy
The package also defines Distance "modifiers" that are inspired by the Python package - fuzzywuzzy. These modifiers are particularly helpful to match strings composed of multiple words (e.g. addresses, company names).
- Partial returns the minimum of the distance between the shorter string and substrings of the longer string.
- TokenSort adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically.
- TokenSet adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
- TokenMax normalizes the distance, and combine the
Partial,TokenSortandTokenSetmodifiers, with penalty terms depending on string.TokenMax(Levenshtein())corresponds to the distance defined in fuzzywuzzy
Levenshtein()("this string", "this string is longer") = 10
Partial(Levenshtein())("this string", "this string is longer") = 0
Notes
- All string distances are case sensitive.