hyperscan icon indicating copy to clipboard operation
hyperscan copied to clipboard

Performance Edge Case

Open stefan-baranoff opened this issue 5 years ago • 1 comments

We have found an interesting edge case in HyperScan 5.0.0 and 5.3.0 performance we are trying to understand it a bit better, form better patterns, and understand if there is a "bug" in play here or not.

The initial culprit pattern was (either streaming or block is equally bad): /\*{5}(\w+\*){20}/Hs

The corpus: A simple HTTP response containing * 32000 times (**********...) in the body; the repeated character is one of the keys.


Removing this one pattern provided a HUGE speedups (10X in a set of about 115 patterns). In attempting to troubleshoot with patbench we discovered that simpler variations are even worse: /\*{5}(\w+\*){20}/Hs => 10X speedup /\*{5}\w/Hs => 30X speedup


Using the simpler variations we found the \w is likely partially the culprit here:

/\*{5}[a-zA-Z]/Hs => not slow to start with, patbench does not pick this regex /\*{5}[a-zA-Z0123]/Hs => not slow to start with /\*{5}[a-zA-Z01234]/Hs => 27X slower than the above

This extends to adding non-numerics to the character class: /\*{5}[a-zA-Z><)(/Hs => fast /\*{5}[a-zA-Z><)("/Hs => slow


This was all initially tested in the context of a larger (115) pattern set, but even isolating the single pattern: /\*{5}[a-zA-Z01234]/Hs against the a PCAP file with * 32K times => 100-150Mbps /\*{5}[a-zA-Z0123]/Hs against the a PCAP file with * 32K times => 1724Mbps


Further testing results: /\*\*[a-zA-Z01234]/Hs => 100-150Mbps /\*[a-zA-Z01234]/Hs => 3000-5000Mbps /\*\*[a-zA-Z0123]/Hs => 20000-30000Mbps


Are we looking at expected behavior? It seems that there's some triggering case that causes HyperScan to drastically change behavior/mode/underlying engine. If this is known/expected, can someone who understands that explain to a pattern writer how to avoid triggering such a case?

We're happy to provide more input if needed to help run down why this "break point" exists.

stefan-baranoff avatar Jun 29 '20 17:06 stefan-baranoff

This is related to character class expansion which will expand your pattern /\*{5}[a-zA-Z0123]/Hs to \*{5}a, \*{5}b, \*{5}c.... \*{5}3. So in this case, we treat your pattern as a literal-set and leverage a fast literal matching algorithm to match your pattern.

However, we don't expand too aggressive to avoid too many literals and changing from /\*{5}[a-zA-Z0123]/Hs to /\*{5}[a-zA-Z01234]/Hs or /\*{5}\w/Hs hits a hard threshold in our code. This causes more expensive regex matching algorithms instead of just literal matching.

/\*{5}(\w+\*){20}/Hs is harder pattern with \w+ that has to been handled by regex matching algorithms.

We'll think about workarounds to mitigate this issue.

xiangwang1 avatar Jul 08 '20 09:07 xiangwang1