unicode-transforms icon indicating copy to clipboard operation
unicode-transforms copied to clipboard

Combine decomposibility check and combining class lookup

Open harendra-kumar opened this issue 5 years ago • 5 comments

Currently we need to do three lookups:

  • is it decomposable?
  • if not decomposable:
    • is it combining?
    • combining class when reordering

We can have a single lookup table storing decomposability and combining class. This will get us all the information in one memory access. We may have to store the combining class in the buffer along with the char for later use when reordering is actually done.

It can potentially speed up both NFD and NFC normalizations. Whether it actually will and how much has to be seen by experimenting.

harendra-kumar avatar May 08 '20 01:05 harendra-kumar

If needed, in the reorder buffer we store the char + combining class in the higher order bits as a Word32 or Word, so simple word comparison can be used to sort.

harendra-kumar avatar May 08 '20 01:05 harendra-kumar

Sounds like a good idea. We can store "is decomposable" as combining class 255, so that it all boils down to a single long byte array, ~128 Kb. Stil fits CPU cache, I believe.

Bodigrim avatar May 08 '20 21:05 Bodigrim

Do you want to try this out? It will be exciting to see where we can go with this.

harendra-kumar avatar May 10 '20 02:05 harendra-kumar

I can probably migrate getCombiningClass, isCombining and isDecomposable to lookup in the same bytearray, but would not have time to go further and switch Data.Unicode.Internal.NormalizeStream to a single lookup.

Bodigrim avatar May 10 '20 19:05 Bodigrim

I can try that. You can push your changes to a branch in this repo, we can collaborate on that.

harendra-kumar avatar May 11 '20 04:05 harendra-kumar