sacremoses
sacremoses copied to clipboard
compile regex objects ahead of time for improved perf.
Compiles regexs where appropriate for improved perf for common operations (subs, searches, matches, finditers). Timeit info below for a microbenchmark (MT1 is original w/o compilation, MT2 is new w/ compilation just for comparison -- this PR replaces the original impl).
In [1]: lines = [line.strip() for line in open('big.txt') if line.strip()][:1000]
In [2]: from sacremoses.tokenize import MosesTokenizer as MT1
In [3]: from sacremoses.tokenize2 import MosesTokenizer as MT2
In [4]: mt1, mt2 = MT1(lang='en'), MT2(lang='en')
In [5]: %timeit [mt1.tokenize(line) for line in lines]
714 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %timeit [mt2.tokenize(line) for line in lines]
658 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As a quick note: if I replace import re with import regex as re, the timeit microbenchmark is 1.62 s ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each). Quite the penalty just by switching the regex engine!