Performance Edge Case
We have found an interesting edge case in HyperScan 5.0.0 and 5.3.0 performance we are trying to understand it a bit better, form better patterns, and understand if there is a "bug" in play here or not.
The initial culprit pattern was (either streaming or block is equally bad):
/\*{5}(\w+\*){20}/Hs
The corpus: A simple HTTP response containing * 32000 times (**********...) in the body; the repeated character is one of the keys.
Removing this one pattern provided a HUGE speedups (10X in a set of about 115 patterns). In attempting to troubleshoot with patbench we discovered that simpler variations are even worse:
/\*{5}(\w+\*){20}/Hs => 10X speedup
/\*{5}\w/Hs => 30X speedup
Using the simpler variations we found the \w is likely partially the culprit here:
/\*{5}[a-zA-Z]/Hs => not slow to start with, patbench does not pick this regex
/\*{5}[a-zA-Z0123]/Hs => not slow to start with
/\*{5}[a-zA-Z01234]/Hs => 27X slower than the above
This extends to adding non-numerics to the character class:
/\*{5}[a-zA-Z><)(/Hs => fast
/\*{5}[a-zA-Z><)("/Hs => slow
This was all initially tested in the context of a larger (115) pattern set, but even isolating the single pattern:
/\*{5}[a-zA-Z01234]/Hs against the a PCAP file with * 32K times => 100-150Mbps
/\*{5}[a-zA-Z0123]/Hs against the a PCAP file with * 32K times => 1724Mbps
Further testing results:
/\*\*[a-zA-Z01234]/Hs => 100-150Mbps
/\*[a-zA-Z01234]/Hs => 3000-5000Mbps
/\*\*[a-zA-Z0123]/Hs => 20000-30000Mbps
Are we looking at expected behavior? It seems that there's some triggering case that causes HyperScan to drastically change behavior/mode/underlying engine. If this is known/expected, can someone who understands that explain to a pattern writer how to avoid triggering such a case?
We're happy to provide more input if needed to help run down why this "break point" exists.
This is related to character class expansion which will expand your pattern /\*{5}[a-zA-Z0123]/Hs to \*{5}a, \*{5}b, \*{5}c.... \*{5}3. So in this case, we treat your pattern as a literal-set and leverage a fast literal matching algorithm to match your pattern.
However, we don't expand too aggressive to avoid too many literals and changing from /\*{5}[a-zA-Z0123]/Hs to /\*{5}[a-zA-Z01234]/Hs or /\*{5}\w/Hs hits a hard threshold in our code. This causes more expensive regex matching algorithms instead of just literal matching.
/\*{5}(\w+\*){20}/Hs is harder pattern with \w+ that has to been handled by regex matching algorithms.
We'll think about workarounds to mitigate this issue.