sacremoses icon indicating copy to clipboard operation
sacremoses copied to clipboard

compile regex objects ahead of time for improved perf.

Open erip opened this issue 3 years ago • 1 comments

Compiles regexs where appropriate for improved perf for common operations (subs, searches, matches, finditers). Timeit info below for a microbenchmark (MT1 is original w/o compilation, MT2 is new w/ compilation just for comparison -- this PR replaces the original impl).

In [1]: lines = [line.strip() for line in open('big.txt') if line.strip()][:1000]

In [2]: from sacremoses.tokenize import MosesTokenizer as MT1

In [3]: from sacremoses.tokenize2 import MosesTokenizer as MT2

In [4]: mt1, mt2 = MT1(lang='en'), MT2(lang='en')

In [5]: %timeit [mt1.tokenize(line) for line in lines]
714 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit [mt2.tokenize(line) for line in lines]
658 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

erip avatar Jul 26 '22 15:07 erip

As a quick note: if I replace import re with import regex as re, the timeit microbenchmark is 1.62 s ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each). Quite the penalty just by switching the regex engine!

erip avatar Jul 26 '22 15:07 erip