ceja icon indicating copy to clipboard operation
ceja copied to clipboard

Add SIMD-accelerated APIs

Open ashvardanian opened this issue 1 year ago • 2 comments

This was indented as a small path upgrading from JellyFish to StringZilla to accelerate some of the slowest and frequently used string similarity measures. Along the way I've patched a few minor things.

  • Hamming and Levenshtein support SIMD and buffers.
  • Added docstrings for all APIs.
  • Fixed non-standard 5-char indent in functions.py.
  • Upgraded PyTest for compatibility with newer Pyhton.
  • Added pkg_resources for setuptools for tests.

Compared to JellyFish, StringZilla is generally at least 20% faster even on shorter strings. It is also more accurate, as JellyFish doesn't correctly handle Unicode strings. Here is a comparison table for the distance output by different packages.

Example Jellyfish Levenshtein RapidFuzz EditDistance NLTK StringZilla (Unicode) StringZilla (Bytes)
0 apple vs aple 1 1 1 1 1 1 1
1 αβγδ vs αγδ 1 1 1 1 1 1 2
2 école vs école 1 2 2 2 2 2 3
3 Schön vs Schön 1 2 2 2 2 2 3
4 💖 vs 💗 1 1 1 1 1 1 1
5 𠜎 𠜱 𠝹 𠱓 vs 𠜎𠜱𠝹𠱓 3 3 3 3 3 3 3
6 München vs Muenchen 2 2 2 2 2 2 2
7 façade vs facade 1 1 1 1 1 1 2
8 こんにちは世界 vs こんばんは世界 2 2 2 2 2 2 3
9 👩‍👩‍👧‍👦 vs 👨‍👩‍👧‍👦 1 1 1 1 1 1 1
10 Data科学123 vs Data科學321 3 3 3 3 3 3 3
11 🙂🌍🚀 vs 🙂🌎✨ 2 2 2 2 2 2 5

ashvardanian avatar Feb 19 '24 04:02 ashvardanian

@ashvardanian - thanks for submitting this. Do you have any benchmarks that show StringZilla makes ceja faster?

MrPowers avatar Feb 21 '24 14:02 MrPowers

@MrPowers I don't have benchmarks specific to Ceja, but have several benchmarks against Jellyfish in the StringZilla repository. There is also a Jupyter notebook to help explore the differences at stringzilla/scripts/bench_similarity.ipynb 🤗

Is there some specific benchmark you have in mind?


PS: There is also a portability issue I haven't referenced. Seems like jellyfish builds only 65 wheels, while today PyPi expects 105 targets. StringZilla publishes all of them.

ashvardanian avatar Feb 22 '24 20:02 ashvardanian