python-hyperscan
python-hyperscan copied to clipboard
How to handle input with characters having more than one byte in UTF-8
Hi,
first of all thank you for this amazing library.
While playing around with it I stumbled upon this issue.
When matching on strings containing characters that UTF-8 converts into more then one byte, the end offset is wrong.
See for instance this example:
import hyperscan
matches = []
def match_event_handler(dbid, start, end, flags, context) -> bool | None:
matches.append(end)
expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
expressions=[e.encode("utf-8") for e in expressions],
)
text = "test®"
db.scan(text.encode("utf-8"), match_event_handler=match_event_handler)
print(matches)
# [5, 6]
The highest end offset is 6 but len("test®") is 5`.
Is there any workaround to this? Am I misunderstanding something?
Thank you!