strutil
strutil copied to clipboard
Golang metrics for calculating string similarity and other string utility functions
strutil
strutil provides a collection of string metrics for calculating string similarity as well as
other string utility functions.
Full documentation can be found at https://pkg.go.dev/github.com/adrg/strutil.
Installation
go get github.com/adrg/strutil
String metrics
- Hamming
- Levenshtein
- Jaro
- Jaro-Winkler
- Smith-Waterman-Gotoh
- Sorensen-Dice
- Jaccard
- Overlap Coefficient
The package defines the StringMetric
interface, which is implemented by all
the string metrics. The interface is used with the Similarity
function, which
calculates the similarity between the specified strings, using the provided
string metric.
type StringMetric interface {
Compare(a, b string) float64
}
func Similarity(a, b string, metric StringMetric) float64 {
}
All defined string metrics can be found in the metrics package.
Hamming
Calculate similarity.
similarity := strutil.Similarity("text", "test", metrics.NewHamming())
fmt.Printf("%.2f\n", similarity) // Output: 0.75
Calculate distance.
ham := metrics.NewHamming()
fmt.Printf("%d\n", ham.Distance("one", "once")) // Output: 2
More information and additional examples can be found on pkg.go.dev.
Levenshtein
Calculate similarity using default options.
similarity := strutil.Similarity("graph", "giraffe", metrics.NewLevenshtein())
fmt.Printf("%.2f\n", similarity) // Output: 0.43
Configure edit operation costs.
lev := metrics.NewLevenshtein()
lev.CaseSensitive = false
lev.InsertCost = 1
lev.ReplaceCost = 2
lev.DeleteCost = 1
similarity := strutil.Similarity("make", "Cake", lev)
fmt.Printf("%.2f\n", similarity) // Output: 0.50
Calculate distance.
lev := metrics.NewLevenshtein()
fmt.Printf("%d\n", lev.Distance("graph", "giraffe")) // Output: 4
More information and additional examples can be found on pkg.go.dev.
Jaro
similarity := strutil.Similarity("think", "tank", metrics.NewJaro())
fmt.Printf("%.2f\n", similarity) // Output: 0.78
More information and additional examples can be found on pkg.go.dev.
Jaro-Winkler
similarity := strutil.Similarity("think", "tank", metrics.NewJaroWinkler())
fmt.Printf("%.2f\n", similarity) // Output: 0.80
More information and additional examples can be found on pkg.go.dev.
Smith-Waterman-Gotoh
Calculate similarity using default options.
swg := metrics.NewSmithWatermanGotoh()
similarity := strutil.Similarity("times roman", "times new roman", swg)
fmt.Printf("%.2f\n", similarity) // Output: 0.82
Customize gap penalty and substitution function.
swg := metrics.NewSmithWatermanGotoh()
swg.CaseSensitive = false
swg.GapPenalty = -0.1
swg.Substitution = metrics.MatchMismatch {
Match: 1,
Mismatch: -0.5,
}
similarity := strutil.Similarity("Times Roman", "times new roman", swg)
fmt.Printf("%.2f\n", similarity) // Output: 0.96
More information and additional examples can be found on pkg.go.dev.
Sorensen-Dice
Calculate similarity using default options.
sd := metrics.NewSorensenDice()
similarity := strutil.Similarity("time to make haste", "no time to waste", sd)
fmt.Printf("%.2f\n", similarity) // Output: 0.62
Customize n-gram size.
sd := metrics.NewSorensenDice()
sd.CaseSensitive = false
sd.NgramSize = 3
similarity := strutil.Similarity("Time to make haste", "no time to waste", sd)
fmt.Printf("%.2f\n", similarity) // Output: 0.53
More information and additional examples can be found on pkg.go.dev.
Jaccard
Calculate similarity using default options.
j := metrics.NewJaccard()
similarity := strutil.Similarity("time to make haste", "no time to waste", j)
fmt.Printf("%.2f\n", similarity) // Output: 0.45
Customize n-gram size.
j := metrics.NewJaccard()
j.CaseSensitive = false
j.NgramSize = 3
similarity := strutil.Similarity("Time to make haste", "no time to waste", j)
fmt.Printf("%.2f\n", similarity) // Output: 0.36
The input of the Sorensen-Dice example is the same as the one of Jaccard because the metrics bear a resemblance to each other. In fact, each of the coefficients can be used to calculate the other one.
Sorensen-Dice to Jaccard.
J = SD/(2-SD)
where SD is the Sorensen-Dice coefficient and J is the Jaccard index.
Jaccard to Sorensen-Dice.
SD = 2*J/(1+J)
where SD is the Sorensen-Dice coefficient and J is the Jaccard index.
More information and additional examples can be found on pkg.go.dev.
Overlap Coefficient
Calculate similarity using default options.
oc := metrics.NewOverlapCoefficient()
similarity := strutil.Similarity("time to make haste", "no time to waste", oc)
fmt.Printf("%.2f\n", similarity) // Output: 0.67
Customize n-gram size.
oc := metrics.NewOverlapCoefficient()
oc.CaseSensitive = false
oc.NgramSize = 3
similarity := strutil.Similarity("Time to make haste", "no time to waste", oc)
fmt.Printf("%.2f\n", similarity) // Output: 0.57
More information and additional examples can be found on pkg.go.dev.
References
For more information see:
- Hamming distance
- Levenshtein distance
- Jaro-Winkler distance
- Smith-Waterman algorithm
- Sorensen-Dice coefficient
- Jaccard index
- Overlap coefficient
Stargazers over time
Contributing
Contributions in the form of pull requests, issues or just general feedback,
are always welcome.
See CONTRIBUTING.MD.
License
Copyright (c) 2019 Adrian-George Bostan.
This project is licensed under the MIT license. See LICENSE for more details.