unicode-transforms
unicode-transforms copied to clipboard
Combine decomposibility check and combining class lookup
Currently we need to do three lookups:
- is it decomposable?
- if not decomposable:
- is it combining?
- combining class when reordering
We can have a single lookup table storing decomposability and combining class. This will get us all the information in one memory access. We may have to store the combining class in the buffer along with the char for later use when reordering is actually done.
It can potentially speed up both NFD and NFC normalizations. Whether it actually will and how much has to be seen by experimenting.
If needed, in the reorder buffer we store the char + combining class in the higher order bits as a Word32 or Word, so simple word comparison can be used to sort.
Sounds like a good idea. We can store "is decomposable" as combining class 255, so that it all boils down to a single long byte array, ~128 Kb. Stil fits CPU cache, I believe.
Do you want to try this out? It will be exciting to see where we can go with this.
I can probably migrate getCombiningClass, isCombining and isDecomposable to lookup in the same bytearray, but would not have time to go further and switch Data.Unicode.Internal.NormalizeStream to a single lookup.
I can try that. You can push your changes to a branch in this repo, we can collaborate on that.