hyperscan
hyperscan copied to clipboard
Literals shouldn't need to be NULL-terminated.
The documentation for hs_compile_lit() states that an expression must be NULL-terminated even though its length must be supplied. Looking though hs_compile_lit_multi_int() I see that nothing after the declared length is used by
https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/hs.cpp#L362-L364
However, there follows a call to
https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/hs.cpp#L373-L374
then
https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/parser/logical_combination.cpp#L143-L144
and finally
https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/hs.cpp#L530-L536
Given that the documentation explicitly allows for \0 to appear in the expression it seems like there is some further bug here, perhaps the absence of a hs_expression_lit_info().
See also #205.
@rmharris Thanks for pointing this. hs_expression_lit_info() is absent. Will fix that.
I've also been bitten by this, as before it was possible to search for nul bytes using nul literals (\u0000) in literal expressions (compile_lit & compile_lit_multi), and now that is not possible anymore, and you have to resort to escaped nuls (\\x{00}) and normal expressions (compile & compile_multi) to be able to do that.
So, I'm glad to know this will be fixed.
BTW, @hongyang7, maybe the chance can be taken to clarify compile_lit and compile_lit_multi doc (https://intel.github.io/hyperscan/dev-reference/api_files.html#c.hs_compile_lit). As it currently stands, it first says that expression will be NULL-terminated:
* @param expression
* The NULL-terminated expression to parse.
[...]
But laters says even more explicitly that NULs are allowed within expression:
* @param len
* The length of the text content of the pure literal expression. As the
* text content indicated by @p expression is treated as single character
* one by one, the special terminating character `\0` should be allowed
* to appear in expression, and not treated as a terminator for a string.
* Thus, the end of a pure literal expression cannot be indicated by
* identifying `\0`, but by counting to the expression length.
That seems contradictory, and IMO it would be best to change expression doc to make it clear that NULL-termination is not needed.
Thx everybody involved for this great project.
@elmart-devo Good point. Thanks.