\b with UTF-8

Open DavidNemeskey opened this issue 4 years ago • 1 comments

Hi,

is there a way in Hyperscan to find a pattern where \b is followed by a non-ASCII character (such as\bö) in the input text? According to my experience, \be matches where it should, but \bö does not. For instance,

\be matches "_ e_" and does not match "_ xe_"
\bö does not match "_ ö_", but it does "_ xö_"

I get the same result irrespective of whether I use HS_FLAG_UTF8 or not; HS_FLAG_UCP gives an error. I could not find anything about \b being incompatible with Unicode in the documentation; in fact, the only place the docs mention HS not supporting UTF8 or \b is in the approximate matching section, which is irrelevant to my use-case.

Thanks!

Aug 23 '21 21:08 DavidNemeskey

Can confirm, UTF-8 doesn't work for me either.

Oct 13 '21 13:10 meadofpoetry