string-dist
string-dist copied to clipboard
A Python library for calculating string distances using C extensions (with a pure Python fallback)
========== StringDist
This package provides the stringdist
module, which includes functions for
calculating raw and normalized versions of the following string distance
measurements:
- Levenshtein distance
- Restricted Damerau-Levenshtein distance (a.k.a. optimal string alignment distance)
For optimal performance, the package compiles and uses a C extension module under the hood, but a Python implementation is included as well and will automatically be used if C extensions are not supported by the system (e.g. when the selected interpreter is PyPy).
Installation
To install this package, just use pip::
pip install StringDist
All Python versions >=3.3
should be supported.
Usage
To use the package, simply import the stringdist
module and call the
desired function, passing in two strings::
import stringdist
stringdist.levenshtein('test', 'testing')
The available functions are as follows:
-
levenshtein
-
levenshtein_norm
-
rdlevenshtein
-
rdlevenshtein_norm
Raw distances assume that every allowed operation has a cost of 1
.
Normalized distances are floats in the range [0.0, 1.0]
, where 0.0
always corresponds to a raw value of 0
and 1.0
always corresponds to
the length of the longer string, i.e. the biggest possible raw value.
Note: The restricted Damerau-Levenshtein distance is not a true distance
metric because it does not satisfy the
triangle inequality <https://en.wikipedia.org/wiki/Triangle_inequality>
_.
This makes it a poor choice for applications that involve evaluating the
similarity of more than two strings, such as clustering.
Bugs and Requests
Please use GitHub Issues <https://github.com/obulkin/string-dist/issues>
_
for bugs and feature requests, checking first to make sure you're not creating
a duplicate issue.
Contributing
Pull requests are welcome. Please discuss your plans first by creating a
GitHub issue and use good coding style. For Python, this means following the
rules laid out in PEP 8 and other relevant PEPs. If in doubt, use a linter
like Pylint <https://www.pylint.org>
_.
To run unit tests::
git clone https://github.com/obulkin/string-dist.git {directory}
cd {directory}
python setup.py install
python -m unittest -v test_stringdist
You can run tests without installing the package, but this will always cause
the Python implementation to be used as the C variant has to be compiled
first. By the same token, any changes to the C code will require recompilation
before showing up in the tests, which can be handled by running
python setup.py install
again.
Contributors
- Oleg Bulkin [email protected]