ceja
ceja copied to clipboard
Add SIMD-accelerated APIs
This was indented as a small path upgrading from JellyFish to StringZilla to accelerate some of the slowest and frequently used string similarity measures. Along the way I've patched a few minor things.
- Hamming and Levenshtein support SIMD and buffers.
- Added docstrings for all APIs.
- Fixed non-standard 5-char indent in
functions.py
. - Upgraded PyTest for compatibility with newer Pyhton.
- Added
pkg_resources
forsetuptools
for tests.
Compared to JellyFish, StringZilla is generally at least 20% faster even on shorter strings. It is also more accurate, as JellyFish doesn't correctly handle Unicode strings. Here is a comparison table for the distance output by different packages.
Example | Jellyfish | Levenshtein | RapidFuzz | EditDistance | NLTK | StringZilla (Unicode) | StringZilla (Bytes) | |
---|---|---|---|---|---|---|---|---|
0 | apple vs aple | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | αβγδ vs αγδ | 1 | 1 | 1 | 1 | 1 | 1 | 2 |
2 | école vs école | 1 | 2 | 2 | 2 | 2 | 2 | 3 |
3 | Schön vs Schön | 1 | 2 | 2 | 2 | 2 | 2 | 3 |
4 | 💖 vs 💗 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
5 | 𠜎 𠜱 𠝹 𠱓 vs 𠜎𠜱𠝹𠱓 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
6 | München vs Muenchen | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
7 | façade vs facade | 1 | 1 | 1 | 1 | 1 | 1 | 2 |
8 | こんにちは世界 vs こんばんは世界 | 2 | 2 | 2 | 2 | 2 | 2 | 3 |
9 | 👩👩👧👦 vs 👨👩👧👦 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
10 | Data科学123 vs Data科學321 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
11 | 🙂🌍🚀 vs 🙂🌎✨ | 2 | 2 | 2 | 2 | 2 | 2 | 5 |
@ashvardanian - thanks for submitting this. Do you have any benchmarks that show StringZilla makes ceja faster?
@MrPowers I don't have benchmarks specific to Ceja, but have several benchmarks against Jellyfish in the StringZilla repository. There is also a Jupyter notebook to help explore the differences at stringzilla/scripts/bench_similarity.ipynb
🤗
Is there some specific benchmark you have in mind?
PS: There is also a portability issue I haven't referenced. Seems like jellyfish
builds only 65 wheels, while today PyPi expects 105 targets. StringZilla publishes all of them.