FastString icon indicating copy to clipboard operation
FastString copied to clipboard

Proper Unicode ToLower/ToUpper

Open dhasenan opened this issue 7 years ago • 4 comments

char.ToLower and char.ToUpper do not handle all characters. Find the raw unicode tables and turn them into proper implementations.

dhasenan avatar Apr 13 '17 02:04 dhasenan

So...

I've come up with a program that builds my own CharInfo stuff. (System.CharInfo is inaccessible due to protection level. Faugh.)

Two problems:

  1. It brings my IDE to its knees.
  2. It crashes as soon as it tries to load the static data.

It also crashes on startup if I try to run something with a switch statement containing 35,000 cases. I wonder why... It actually crashes so bad that I have to kill the terminal window.

I can resort to native code if I must. Explicitly control struct layout and enum member values, then compile the unicode data text file into binary. Load in native code, etc.

I might be able to do something effectively identical without native code. The thing I really want, though, is O(1) lookups. How do I make that happen fast? The answer is to waste space and to shift names around.

Two embedded resources. One just contains names. The other contains CharInfo serialized structs. I calculate how long the CharInfo's members need to be (specifically the decomposition) and pad as necessary. Two of the members are the start and length offsets into the names resource to find the name.

Now I have a fixed size for the struct. I can find where in the resource the struct for a given codepoint is located in O(1) time (assuming no gaps, and I can ensure that when I compile).

A little awkward, but it should be manageable.

Now, if I just want the upper/lower mapping, that's easier. I can binary search the relevant entries -- it's about 1300 each. I can compare that to having the full array.

I can also hard-code the common Latin subset.

dhasenan avatar Apr 13 '17 05:04 dhasenan

Hrm, System.CharInfo is specific to Mono and doesn't do anything like what I want. Looking at other options in the CLR.

dhasenan avatar Apr 13 '17 05:04 dhasenan

Importing Unicode data as serialized resource files seems to work.

dhasenan avatar Apr 14 '17 15:04 dhasenan

So, we've got the serialized unicode data. (ASCII -> binary provided approximately zero compression.) I can look up unicode codepoint info. This takes a binary search. If I added a distance heuristic, that would probably make it faster... Also, there are probably a lot of contiguous ranges, so I could binary/linear search through a couple dozen ranges and then go with fixed offsets.

This isn't just an upper/lower mapping. I can make that alone; it'll be 1300-ish entries, mostly binary search.

dhasenan avatar Apr 14 '17 15:04 dhasenan