hyperscan
hyperscan copied to clipboard
Is it possible to support pure literal or regex parsing on a per pattern basis?
For users with both regex and pure literal patterns, it would be more convenient if hs_compile_multi() can take additional info on each pattern indicating if it's a regex or pure literal. Would there be a chance that this feature will be supported in the future?
Thanks for your concern.
Could you please state more clearly about the motivation
of making such a distinction? What does the "convenience" mean?
Thanks for the reply.
Let's say if a user has both regex patterns and literal patterns that he would like to add into a single database (because he has only one type of network traffic). In this case, he can't use the new API hs_compile_lit_multi() introduced in version 5.2.0 and needs to use hs_compile_multi() instead. He will need to worry about the meta characters in the pure literal patterns and add appropriate escaping for those. If hs_compile_multi() could take in a list of patterns, each of them saying whether it's a regex or pure literals then he just needs to mark regex patterns with regex flags, and literal patterns with literal flags and passes them directly into the library.
I hope this explains about the motivation.
Understood.
For the meta characters conversion, have you tried the hexadecimal way? Just converting every character of a pure literal pattern into its hexadecimal representation. It's fine to feed this result into Hyperscan API. For example, from /127.0.0.1/ to /\x31\x32\x37\x2E\x30\x2E\x30\x2E\x31/. The coding work could be easier than adding back slash '\' for meta characters. Here is a sample for your reference: makeHex.txt
As for whether Hyperscan will support the distinction in present APIs, it still depends on the general user requirements. Similar libraries like pcre also needs users' conversion. We haven't hear a strong voice for this change till now. However we wouldn't exclude the possibility for such extension. Thanks again for the suggestion.
Thanks for the suggestion. Indeed I use hexadecimal representation as a workaround. The reason I raised the question is due to the new literal API introduced in 5.2.0, just thinking that it could be a bit more convenient for users.
Thanks for the suggestion. We'll keep collecting feedback on this issue.
I'd like to voice my interest in this proposal as well. The ability to pass a flag HS_FLAG_LITERAL
to hs_compile
, so that the given string is treated like a pure literal would make using hyperscan much more convinient.
Hi, I'd like to add my feedback. This feature seems to be easy to implement in Hyperscan itself, but it's quite cumbersome for users to work around.
I'm developing a find-like CLI tool build on Hyperscan and I had to write my own translation function for literals. I understand that this is not a typical use of Hyperscan, but when comparing with alternatives (PCRE, RE2) it came out as clear winner - not (only) for performance, but also the API, documentation and ease of use.
It seems to me that a HS_FLAG_LITERAL
could completely replace hs_compile_lit
, unless there is some internal (performance) benefit from a purely literal database.