hyperscan icon indicating copy to clipboard operation
hyperscan copied to clipboard

Getting "Invalid UTF-8 character for UTF control characters'

Open paresh-panda opened this issue 3 years ago • 6 comments

Hi, While compiling expression along with utf 8 chars getting below error. If I remove HS_FLAG_UTF8 flag then it compiles fine. IS there any restriction for utf8 control characters? "bob logged in from "

code snippet if (hs_compile(test1, HS_FLAG_DOTALL|HS_FLAG_UTF8, HS_MODE_BLOCK, NULL, &database, &compile_err) != HS_SUCCESS) { fprintf(stderr, "ERROR: Unable to compile pattern "%s": %s\n", test1, compile_err->message); hs_free_compile_error(compile_err);

ERROR: Unable to compile pattern "bob logged in from ": Expression is not valid UTF-8.

paresh-panda avatar Jul 20 '22 07:07 paresh-panda

Hyperscan supports UTF8 patterns in 2 ways: has utf8 flag set, or has (*UTF8) control verbs at the beginning of a pattern. Both ways will let Hyperscan know user intends to compile a UTF8 pattern. However, only the expression itself is a valid UTF8 string, can Hyperscan finally handles it in UTF8.

The following code just checks UTF8 validity of an expression body. FYI. https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/compiler/compiler.cpp#L171

hongyang7 avatar Jul 20 '22 22:07 hongyang7

Thank you for the quick response! I have added HS_FLAG_UTF8 flag , and the UTF-8 control character, whether at the end, middle, or end, is giving me an invalid expression error.

if (hs_compile(test1, HS_FLAG_DOTALL|HS_FLAG_UTF8, HS_MODE_BLOCK, NULL, &database,

Selection_071

Can you please suggest how to proceed? Thank you!

paresh-panda avatar Jul 25 '22 13:07 paresh-panda

Can you provide us the full test code? Better in .txt attachment. By simply copying the expression in your original question seems cannot produce the error.

hongyang7 avatar Jul 25 '22 17:07 hongyang7

Hi, Please find the attached txt file for the sample code.

Thank you! HSPoc_cpp.txt

paresh-panda avatar Jul 26 '22 05:07 paresh-panda

Hi Can you please provide your comments over the sample code, the first byte of the char string is the utf 8 control-del character?

paresh-panda avatar Jul 28 '22 04:07 paresh-panda

Hey sorry for being late. Your code should be fine, because the error comes from a bug in our utf8 validity function, where we mistreat 0x7f as an invalid one-byte utf8 case:

https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/parser/utf8_validate.cpp#L74-L76 Should be "s[i] <= 0x7f" here.

Your first byte of char string happens to fall into the corner cases. We'll push the fix recently. You might currently do manually modification if needed.

hongyang7 avatar Jul 28 '22 21:07 hongyang7

Please refer to latest develop branch. Commit id: 062c3906c5f95182e975462d02556558b433800b

hongyang7 avatar Oct 28 '22 08:10 hongyang7