text
text copied to clipboard
Fast codepointOffset
Implements codepointOffset
with code from the FreeBSD project.
I'm planning to explore making a vectorised implementation of the searching for 2, 3 and 4 char codepoints, but will leave that out in the first iteration.
This may be relevant to #369, by eliminating the need to decode codepoints via Haskell.
I'm not sure why older GHCs are unable to infer the types for the tests I've added, since the types should all be trivially known (Text and Char).
Thanks @axman6! I suggest we start with splitOnChar
/ breakOnChar
in a separate PR. First naive implementation, tests and benchmarks, then make it fast with whatever it takes. Tackling both splitOnChar
and memmem
in one go feels a bit overwhelming.
Yeah I've been working on rewriting the C to avoid going via memmem, and removing the twoway_memmem would significantly reduce the amount of code to maintain. I would guess there are faster memmem implementations out there, hopefully under permissive licenses too. I'll get the changes working and push those today.
I have a suspicion that breakOnChar
/ splitOnChar
does not mandate any additional C code at all. It might be enough to memchr
the least significant byte of the UTF-8 encoding and then check manually that other bytes match.
Anyways, let's separate concerns. From my perspective the first task is to add breakOnChar
/ splitOnChar
with naive, pure Haskell implementation. Once it is done and merged, we can discuss optimizations in a separate PR.
I'll try and find some time to write a Haskell only version, and then we can think about making a faster C one later. I wonder if it's worth having both, and only moving to the C call when there's enough data to justify it.