re2j
re2j copied to clipboard
Re2j runs 5x slower than java.util.regex
trafficstars
Here is my benchmark:
import org.apache.commons.lang3.RandomStringUtils;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Threads;
import org.openjdk.jmh.annotations.Warmup;
import java.util.Iterator;
import java.util.List;
import java.util.concurrent.TimeUnit;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
@Warmup(iterations = 2, time = 3)
@Measurement(iterations = 2, time = 3)
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Threads(3)
@Fork(1)
/*
* <pre>
* Benchmark Mode Cnt Score Error Units
* Re2jBenchmark.javaMatch avgt 2 175.926 ns/op
* Re2jBenchmark.re2jMatch avgt 2 1067.591 ns/op
* <pre/>
*/
public class Re2jBenchmark {
private static final String PATTERN = "[0-5][0-6]{1,3}[1-7]{1,2}[2580]{1,3}[127]{0,8}";
private static final Pattern p1 = Pattern.compile(PATTERN);
private static final com.google.re2j.Pattern p2 = com.google.re2j.Pattern.compile(PATTERN);
private List<String> inputs;
private Iterator<String> iterator;
@Setup
public void setup() {
inputs =
IntStream.rangeClosed(0, 999)
.mapToObj(n -> RandomStringUtils.randomNumeric(32))
.collect(Collectors.toList());
iterator = inputs.listIterator();
}
private String nextInput() {
if (!iterator.hasNext()) {
iterator = inputs.listIterator();
}
return iterator.next();
}
@Benchmark
public boolean javaMatch() {
return p1.matcher(nextInput()).find();
}
@Benchmark
public boolean re2jMatch() {
return p2.matcher(nextInput()).find();
}
}
here is the result:
Benchmark Mode Cnt Score Error Units
Re2jBenchmark.javaMatch avgt 2 175.926 ns/op
Re2jBenchmark.re2jMatch avgt 2 1067.591 ns/op
I get similar results (RE2 is 3x slower):
Benchmark Mode Cnt Score Error Units
Re2Benchmark.wholeWordsEndingWithNn_jdk thrpt 5 4.496 ± 0.292 ops/s
Re2Benchmark.wholeWordsEndingWithNn_re2 thrpt 5 1.582 ± 0.124 ops/s
Note: I tested on sample (16MB book) from https://github.com/rust-leipzig/regex-performance and pattern \b\w+nn\b for which RE2 excels in their benchmark.
On the other hand, RE2 is tremendously faster for regex with 100 alternations (first names) on the same sample, which is consistent with https://github.com/google/re2j#why-should-i-switch:
Benchmark Mode Cnt Score Error Units
Re2Benchmark.manyAlternation_jdk thrpt 5 0.278 ± 0.002 ops/s
Re2Benchmark.manyAlternations_re2 thrpt 5 111.601 ± 0.324 ops/s
Note: length of input text also matters
Code: https://github.com/RG9/rg-regex-demo/commit/209d9b4eb2c883587f4e4d8b584e6f17f62d41ee#diff-fe662066dca41f899d163250973f0db8e950953eb9b03a7d08c71f5fdb6a0d70