sacremoses
sacremoses copied to clipboard
2x faster MosesTokenizer
I have added two speed improvements:
- Compile regex patterns.
- Pre-define the character sets for islower() and isanyalpha().
Before:
Benchmark 1: python -m sacremoses -l en -j 1 tokenize < big.txt
Time (mean ± σ): 14.799 s ± 0.875 s [User: 16.009 s, System: 0.047 s]
Range (min … max): 13.994 s … 16.786 s 10 runs
After:
Benchmark 1: python -m sacremoses -l en -j 1 tokenize < big.txt
Time (mean ± σ): 7.669 s ± 0.653 s [User: 8.313 s, System: 0.054 s]
Range (min … max): 5.934 s … 8.252 s 10 runs
The results of unittest:
python -m unittest sacremoses/test/test_*
..........................s............
----------------------------------------------------------------------
Ran 39 tests in 5.028s
OK (skipped=1)