text Fast codepointOffset

Fast codepointOffset

Open axman6 opened this issue 2 years ago • 5 comments

Implements codepointOffset with code from the FreeBSD project.

I'm planning to explore making a vectorised implementation of the searching for 2, 3 and 4 char codepoints, but will leave that out in the first iteration.

This may be relevant to #369, by eliminating the need to decode codepoints via Haskell.

Jul 01 '22 08:07 axman6

I'm not sure why older GHCs are unable to infer the types for the tests I've added, since the types should all be trivially known (Text and Char).

Jul 01 '22 10:07 axman6

Thanks @axman6! I suggest we start with splitOnChar / breakOnChar in a separate PR. First naive implementation, tests and benchmarks, then make it fast with whatever it takes. Tackling both splitOnChar and memmem in one go feels a bit overwhelming.

Jul 01 '22 20:07 Bodigrim

Yeah I've been working on rewriting the C to avoid going via memmem, and removing the twoway_memmem would significantly reduce the amount of code to maintain. I would guess there are faster memmem implementations out there, hopefully under permissive licenses too. I'll get the changes working and push those today.

Jul 02 '22 00:07 axman6

I have a suspicion that breakOnChar / splitOnChar does not mandate any additional C code at all. It might be enough to memchr the least significant byte of the UTF-8 encoding and then check manually that other bytes match.

Anyways, let's separate concerns. From my perspective the first task is to add breakOnChar / splitOnChar with naive, pure Haskell implementation. Once it is done and merged, we can discuss optimizations in a separate PR.

Jul 02 '22 15:07 Bodigrim

I'll try and find some time to write a Haskell only version, and then we can think about making a faster C one later. I wonder if it's worth having both, and only moving to the C call when there's enough data to justify it.

Jul 16 '22 06:07 axman6

text text copied to clipboard

Fast codepointOffset

text
text copied to clipboard