python-hyperscan icon indicating copy to clipboard operation
python-hyperscan copied to clipboard

How to handle input with characters having more than one byte in UTF-8

Open mar4th3 opened this issue 1 year ago • 4 comments

Hi,

first of all thank you for this amazing library.

While playing around with it I stumbled upon this issue.

When matching on strings containing characters that UTF-8 converts into more then one byte, the end offset is wrong.

See for instance this example:

import hyperscan

matches = []


def match_event_handler(dbid, start, end, flags, context) -> bool | None:
    matches.append(end)


expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
    expressions=[e.encode("utf-8") for e in expressions],
)


text = "test®"
db.scan(text.encode("utf-8"), match_event_handler=match_event_handler)

print(matches)
# [5, 6]

The highest end offset is 6 but len("test®") is 5`.

Is there any workaround to this? Am I misunderstanding something?

Thank you!

mar4th3 avatar Aug 29 '24 11:08 mar4th3

Because len(text.encode()) is 6 text.encode() == b'test\xc2\xae'

betterlch avatar Sep 02 '24 02:09 betterlch

Thank you for the reply! I understand that that's the reason, but is there any workaround?

Or is this a limitation of hyperscan? In the sense that you cannot get exact offsets with UTF-8.

mar4th3 avatar Sep 02 '24 09:09 mar4th3

Thank you for the reply! I understand that that's the reason, but is there any workaround?

Or is this a limitation of hyperscan? In the sense that you cannot get exact offsets with UTF-8.

try add flag HS_FLAG_UTF8

expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
    expressions=[e.encode("utf-8") for e in expressions], flags=[hyperscan.HS_FLAG_UTF8],
)

betterlch avatar Sep 04 '24 07:09 betterlch

I'm facing the same issue. Adding the UTF-8 flag does not solve the issue, and the matches returned by db.scan() come with wrong indexes after encountering an unicode char.

For instance, if we have my_string="österreich" is encoded with bytes(my_string, 'utf-8') or my_string.encode('utf-8'), it results in b'\xc3\x96sterreich', which has 1 char more than the original text. The hyperscan match position index will by shifted by one char to the right due to this.

The problem gets worse if it is a kanji (Chinese characters), katakana or hiragana (Japanese characters) which yields 3 chars each when encoded, making the match indexes be misplaced by 2 for every character it encounters.

Looks like a bug that should be addressed by the internal processing of the HS_FLAG_UTF8 flag.

LucianoBAF avatar Jan 24 '25 14:01 LucianoBAF