re2j
re2j copied to clipboard
Resolve all groups when performing match
By default, RE2J does not resolve all the groups when performing a match operation, rather it only resolves groups when Matcher.group(*) is called. This results in input being processed twice. Once during find() and then again during group(*). This increase the latency when Matcher.find() and Matcher.group(*) are called in succession.
This Pull Request attempts to reduce the latency by half by resolving the groups when Matcher.find() is called, such that groups can be served from cache when Matcher.group(*) is called.
We can clearly see from the following benchmark tests that when resolveGroups == true (When the groups are resolved during find()) and when successMatch==ture (When the patterns are matched) for binary data we are seeing 0.609 ms/op vs 1.147 ms/op, providing 47% latency gain. Similarly, when the successMatch==false (When there are no matching patterns) we are only seeing 4% regression (0.332 ms/op vs 0.319 ms/op). When string is used as the input, resolving groups during find() seems to outperform in all cases.
Benchmark (binary) (impl) (resolveGroups) (successMatch) Mode Cnt Score Error Units
BenchmarkSubMultiMatch.findDomains true JDK true true avgt 5 1.636 ± 0.095 ms/op
BenchmarkSubMultiMatch.findDomains true JDK true false avgt 5 1.338 ± 0.036 ms/op
BenchmarkSubMultiMatch.findDomains true JDK false true avgt 5 1.539 ± 0.052 ms/op
BenchmarkSubMultiMatch.findDomains true JDK false false avgt 5 1.336 ± 0.039 ms/op
BenchmarkSubMultiMatch.findDomains true RE2J true true avgt 5 0.609 ± 0.023 ms/op
BenchmarkSubMultiMatch.findDomains true RE2J true false avgt 5 0.332 ± 0.008 ms/op
BenchmarkSubMultiMatch.findDomains true RE2J false true avgt 5 1.147 ± 0.071 ms/op
BenchmarkSubMultiMatch.findDomains true RE2J false false avgt 5 0.319 ± 0.007 ms/op
BenchmarkSubMultiMatch.findDomains false JDK true true avgt 5 1.430 ± 0.311 ms/op
BenchmarkSubMultiMatch.findDomains false JDK true false avgt 5 1.192 ± 0.031 ms/op
BenchmarkSubMultiMatch.findDomains false JDK false true avgt 5 1.339 ± 0.064 ms/op
BenchmarkSubMultiMatch.findDomains false JDK false false avgt 5 1.196 ± 0.009 ms/op
BenchmarkSubMultiMatch.findDomains false RE2J true true avgt 5 0.610 ± 0.022 ms/op
BenchmarkSubMultiMatch.findDomains false RE2J true false avgt 5 0.329 ± 0.004 ms/op
BenchmarkSubMultiMatch.findDomains false RE2J false true avgt 5 1.100 ± 0.014 ms/op
BenchmarkSubMultiMatch.findDomains false RE2J false false avgt 5 0.337 ± 0.039 ms/op
As there are use cases only Matcher.find() being called, and in such cases resolving groups can be an overkill, therefore I have enabled this optimization via a flag so we can tap into this optimization only when it's needed.
Please provide your feedback on the pull request. I'm happy to make necessary changes to get this optimization merged.
Any feedback on this pull request? Can we get this merged to the master?