sacremoses icon indicating copy to clipboard operation
sacremoses copied to clipboard

2x faster MosesTokenizer

Open de9uch1 opened this issue 8 months ago • 0 comments

I have added two speed improvements:

  • Compile regex patterns.
  • Pre-define the character sets for islower() and isanyalpha().

Before:

Benchmark 1: python -m sacremoses -l en -j 1 tokenize < big.txt
  Time (mean ± σ):     14.799 s ±  0.875 s    [User: 16.009 s, System: 0.047 s]
  Range (min … max):   13.994 s … 16.786 s    10 runs

After:

Benchmark 1: python -m sacremoses -l en -j 1 tokenize < big.txt
  Time (mean ± σ):      7.669 s ±  0.653 s    [User: 8.313 s, System: 0.054 s]
  Range (min … max):    5.934 s …  8.252 s    10 runs

The results of unittest:

python -m unittest sacremoses/test/test_*
..........................s............
----------------------------------------------------------------------
Ran 39 tests in 5.028s

OK (skipped=1)

de9uch1 avatar May 26 '24 09:05 de9uch1