str_metrics
str_metrics copied to clipboard
Ruby gem (native extension in Rust) providing implementations of various string metrics
StrMetrics
Ruby gem (native extension in Rust) providing implementations of various string metrics. Current metrics supported are: Sørensen–Dice, Levenshtein, Damerau–Levenshtein, Jaro & Jaro–Winkler. Strings that are UTF-8 encodable (convertible to UTF-8 representation) are supported. All comparison of strings is done at the grapheme cluster level as described by Unicode Standard Annex #29; this may be different from many gems that calculate string metrics. See here for known compatibility.
Getting Started
Prerequisites
Install Rust (tested with version >= 1.47.0) with:
curl https://sh.rustup.rs -sSf | sh
Known compatibility
Ruby
3.1, 3.0, 2.7, 2.6, 2.5, 2.4, 2.3, jruby, truffleruby
Rust
1.60.0, 1.59.0, 1.58.1, 1.57.0, 1.56.1, 1.55.0, 1.54.0, 1.53.0, 1.52.1, 1.51.0, 1.50.0, 1.49.0, 1.48.0, 1.47.0
Platforms
Linux, MacOS, Windows
Installation
With bundler
Add this line to your application's Gemfile:
gem 'str_metrics'
And then execute:
$ bundle install
Without bundler
$ gem install str_metrics
Usage
All you need to do to use the metrics provided in this gem is to make sure str_metrics is required like:
require 'str_metrics'
Each metric is shown below with an example & meanings of optional parameters.
Sørensen–Dice
StrMetrics::SorensenDice.coefficient('abc', 'bcd', ignore_case: false)
=> 0.5
Options:
| Keyword | Type | Default | Description |
|---|---|---|---|
ignore_case |
boolean | false |
Case insensitive comparison? |
Levenshtein
StrMetrics::Levenshtein.distance('abc', 'acb', ignore_case: false)
=> 2
Options:
| Keyword | Type | Default | Description |
|---|---|---|---|
ignore_case |
boolean | false |
Case insensitive comparison? |
Damerau–Levenshtein
StrMetrics::DamerauLevenshtein.distance('abc', 'acb', ignore_case: false)
=> 1
Options:
| Keyword | Type | Default | Description |
|---|---|---|---|
ignore_case |
boolean | false |
Case insensitive comparison? |
Jaro
StrMetrics::Jaro.similarity('abc', 'aac', ignore_case: false)
=> 0.7777777777777777
Options:
| Keyword | Type | Default | Description |
|---|---|---|---|
ignore_case |
boolean | false |
Case insensitive comparison? |
Jaro–Winkler
StrMetrics::JaroWinkler.similarity('abc', 'aac', ignore_case: false, prefix_scaling_factor: 0.1, prefix_scaling_bonus_threshold: 0.7)
=> 0.7999999999999999
StrMetrics::JaroWinkler.distance('abc', 'aac', ignore_case: false, prefix_scaling_factor: 0.1, prefix_scaling_bonus_threshold: 0.7)
=> 0.20000000000000007
Options:
| Keyword | Type | Default | Description |
|---|---|---|---|
ignore_case |
boolean | false |
Case insensitive comparison? |
prefix_scaling_factor |
decimal | 0.1 |
Constant scaling factor for how much to weight common prefixes. Should not exceed 0.25. |
prefix_scaling_bonus_threshold |
decimal | 0.7 |
Prefix bonus weighting will only be applied if the Jaro similarity is greater given value. |
Motivation
The main motivation was to have a central gem which can provide a variety of string metric calculations. Secondary motivation was to experiment with writing a native extension in Rust (instead of C).
Development
Getting started
gem install bundler
git clone https://github.com/anirbanmu/str_metrics.git
cd ./str_metrics
bundle install
Building (for native component)
rake rust_build
Testing (will build native component before running tests)
rake spec
Local installation
rake install
Deploying a new version
To deploy a new version of the gem to rubygems:
- Bump version in version.rb according to SemVer.
- Get your code merged to
mainbranch - After a
git pullonmainbranch:
rake build && rake release
Authors
See all repo contributors here.
Versioning
SemVer is employed. See tags for released versions.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/anirbanmu/str_metrics.
Code of Conduct
Everyone interacting in this project's codebase, issue trackers etc. are expected to follow the code of conduct.
License
This project is licensed under the MIT License - see the LICENSE file for details