uap-go icon indicating copy to clipboard operation
uap-go copied to clipboard

Allow a regular expression package to be replaced by another package.

Open aka-ao opened this issue 1 year ago • 3 comments

#59 Related to this PR, the regexp package is very slow. I suggest creating an interface and changing it so that processing related to regular expressions can be injected externally as a dependency.

aka-ao avatar May 14 '24 07:05 aka-ao

Since that pr, we've added lru caching to the library, which makes frequently seen user agents parse almost instantly. Is the performance with that really a problem?

I'd be open to considering faster regex implementations if it gives a large benefit, though I'd think it'd make more sense if the regex engine we use was not exposed for dependency injection - if any regex engine is missing features we need or implements a slightly different flavor of regex which ends up having observable behavior differences, that would be problematic.

dgoldstein0 avatar May 15 '24 08:05 dgoldstein0

Since that pr, we've added lru caching to the library, which makes frequently seen user agents parse almost instantly. Is the performance with that really a problem?

While that is true, it is obviously only for user agents which have been seen before and are still in the cache by their second hit. FWIW a while back a dailymotion employee kindly provided a sample of their access logs^dailymats, which should be realistic[^1], it has ~75k entries of which ~20k unique, with a very long tail of "one hit wonders", user agents seen only once and never again (this is relevant because as I learned LRU is pretty bad at one hit wonders at low sizes). On that file, when I investigated different cache algorithms I got the following hit rates at cachesize 1000 (which I understand is what you're using):

belady[^belady] (1000): 60% hit rate lru (1000): 37% hit rate s3fifo (1000): 50% hit rate sieve (1000): 48% hit rate

So even with a cache, the regex performance can be quite impactful.

With that said, rather than switching the regex library what I would recommend is looking if somebody has implemented FilteredRE2 in Go: regexes.yaml has a lot of regexes, there are more than 600 device parsers as of 0.18, a faster regex engine will not make that much of a difference in the end I think.

What FilteredRE2 does is prefilter the regexes, and then try only those which have a chance of matching[^filter]. Needing to only try 10-15 regexes is a much more significant gain than trying 600 regexes a bit faster.

An other thing you may want to look at — but this one you'll really have to bench for go specifically as regexp may or may not have this issue — is using Regexp.Match first, and only performing capture (retrieving the submatches) on matches: in both re2 and rust-regex the overhead of capture is very large (2-3x the cost of a match check), so it only takes 1-3 failures for "n match checks followed by one capture" to outperform "n captures". For this reason FilteredRE2 does not support capturing at all, after prefiltering it will only PartialMatch the selected regex: https://github.com/google/re2/blob/main/re2/filtered_re2.cc#L98-L122

[^1]: although obviously it's realistic traffic for dailymotion, a different site can have just as realistic a traffic with a different pattern [^belady]: Bélády's algorithm is a non-practical cache algorithm which can see in the future and has perfect knowledge of which entries to reject or evict, it is essentially the best that can be done at that cache size, making it a useful point of reference [^filter]: basically collecting the atoms (literal strings) each regex requires when building the filter, matching atoms to needles using something like aho-corasick, then reverse-mapping the set of matched atoms to regexes

masklinn avatar Oct 26 '24 09:10 masklinn

The standard library's regexp is already quite fast for capturing groups and is often the fastest choice when working in Go. While faster libraries like Hyperscan exist, Go lacks a direct Hyperscan-like (streaming DFA) library that also supports capturing-group offsets. Libraries like PCRE or Oniguruma can do that, but it's a toss-up whether they'll actually be faster.

I agree with masklinn BTW

amfonelic avatar Dec 30 '24 00:12 amfonelic