How to handle input with characters having more than one byte in UTF-8
Hi,
first of all thank you for this amazing library.
While playing around with it I stumbled upon this issue.
When matching on strings containing characters that UTF-8 converts into more then one byte, the end offset is wrong.
See for instance this example:
import hyperscan
matches = []
def match_event_handler(dbid, start, end, flags, context) -> bool | None:
matches.append(end)
expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
expressions=[e.encode("utf-8") for e in expressions],
)
text = "test®"
db.scan(text.encode("utf-8"), match_event_handler=match_event_handler)
print(matches)
# [5, 6]
The highest end offset is 6 but len("test®") is 5`.
Is there any workaround to this? Am I misunderstanding something?
Thank you!
Because len(text.encode()) is 6
text.encode() == b'test\xc2\xae'
Thank you for the reply! I understand that that's the reason, but is there any workaround?
Or is this a limitation of hyperscan? In the sense that you cannot get exact offsets with UTF-8.
Thank you for the reply! I understand that that's the reason, but is there any workaround?
Or is this a limitation of hyperscan? In the sense that you cannot get exact offsets with UTF-8.
try add flag HS_FLAG_UTF8
expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
expressions=[e.encode("utf-8") for e in expressions], flags=[hyperscan.HS_FLAG_UTF8],
)
I'm facing the same issue. Adding the UTF-8 flag does not solve the issue, and the matches returned by db.scan() come with wrong indexes after encountering an unicode char.
For instance, if we have my_string="österreich" is encoded with bytes(my_string, 'utf-8') or my_string.encode('utf-8'), it results in b'\xc3\x96sterreich', which has 1 char more than the original text. The hyperscan match position index will by shifted by one char to the right due to this.
The problem gets worse if it is a kanji (Chinese characters), katakana or hiragana (Japanese characters) which yields 3 chars each when encoded, making the match indexes be misplaced by 2 for every character it encounters.
Looks like a bug that should be addressed by the internal processing of the HS_FLAG_UTF8 flag.