generate small/fast table for changes_when_{uppercased,lowercased}

Open thomcc opened this issue 2 years ago • 2 comments

This a rough pr with finished design/impl but the code is just a mess[^1], and I wasn't really going to PR this, but here it is . The output combined all into one playground can be seen here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=3568421c224d239f574e1eec7a964381. It has a test which shows answers both queries correclty for every char.

[^1]: Like, it's a huge mess aside from proving the approach (the code has tons of debugging stuff and duplication, etc), I'd clean it up a lot if you were interested.

The basic notes are:

both props are answered w/ a binary search against the same ~1kbish table. Code size impact is minimal.
the logic to search and interpret the table is small and fast, especially for characters that never change under lowercase/uppercase mapping.
There's a special case for ASCII inputs so they don't have to hit a table at all.
The table encoding/format/generator are totally custom for this purpose. Otherwise it'd be bigger.

I could probably get it faster than this but it's already overengineered enough.

The overall approach is to store a list of ranges in a table, and binary search that. The ranges might indicate:

a run chars that all have the same property values for those two properties (range of chars which only change when uppercased, range of chars which change for both, etc). for example U+00c0..=U+00d6 only change under lowercase (but not uppercase)
A run of chars which alternate between upper,lower,upper,lower,upper,lower (or similarly lower,upper,lower,upper,...). This is very common in Unicode, and special casing this is why the is only 200ish ranges (and not over 1000, most of them for a single character)

Runs with lengths that dont fit into 8 bits are split into multiple contiguous smaller ones, and then it's encoded as u32, as MSB[21 bit start_char | 3 bit range type | 8 bit length]LSB.

This seems likely to work indefinitely since 21 bits can fit any char, and we dont use all the values for the 3 bit range type. That said, it probably won't have to change.

The generator uses a greedyish algorithm to categorize every character into a range. It then filters out ASCII and "no changes" ranges, splits them up (with minimal cleanup) and encodes each range into a u32.

I started this a while ago, then forgot, then came back and banged it all out this weekend. It's a problem space near/dear to my hear tho -- ive spent an unreasonable amount of time on making unicode tables better.

Re: https://github.com/RReverser/cow-utils-rs/pull/3#issuecomment-1741358155

Yeah, as I said, as long as it doesn't blow up the size, it should be fine

Well, the size impact of this on generated binary is very small, and it runs fast too. But it's rather high complexity in the table generator. So the size impact in this repo is high. I don't mind owning/maintaining that, and I'm happy to get the code more cleaned up if you want, but I wouldn't be offended if you don't.

FWIW, it only relies on unicode stable promises (like the number of bits in a codepoint), and handles cases that could change (8 len bits not being enough). That means in theory future unicode updates should not come with any drama.

It's unfortunate that essentially every crate (std, regex, us) has to ship its own version of unicode data though.

My general feeling is that with enough elbow grease you can get any of the unicode tables small. It's a compression (and data access) problem, just not a very well-studied one for whatever reason. I had a scratch workspace that could produce all tables regex needed with ~40kb data. It would have required a lot of changes to actually use, so I put it in my own regex engine, which never saw the light of day. So it goes.

Still, it's not ideal that everything has their own tables, but I don't see an alternative really. I don't want load them from the system, and stuff like icu4x feels like the wrong choice for code which cares about footprint.

Oct 11 '23 01:10 thomcc

and stuff like icu4x feels like the wrong choice for code which cares about footprint

Oh yeah so I actually did try to use icu-properties for this couple of days ago, but somehow it made code even slower than it is right now (with manual changes_when_uppercased implementation).

I'm going to try and review this soon-ish, but, as you said, it's a lot of code so might take some time.

Meanwhile, could you post benchmarks for before/after this change? I wonder if it actually speeds things up.

Oct 11 '23 02:10 RReverser