chars Suggestion: Unicode version codepoint was added

I deal with Unicode a fair bit and chars is a handy tool. Sometimes it would be convenient to know which Unicode version assigned a particular codepoint.

E.g the output from chars might look something like this. The version information might not be shown by default and require a command line flag if it was deemed too noisy.

$ chars party
U+0001F973, &#129395; 0x0001F973, \0374563, UTF-8: f0 9f a5 b3, UTF-16BE: d83edd73
Width: 2, prints as 🥳
Quotes as \u{1f973}
Unicode name: FACE WITH PARTY HORN AND PARTY HAT
Unicode version: 11.0

U+0001F389, &#127881; 0x0001F389, \0371611, UTF-8: f0 9f 8e 89, UTF-16BE: d83cdf89
Width: 2, prints as 🎉
Quotes as \u{1f389}
Unicode name: PARTY POPPER
Unicode version: 6.0

I think the information is available via the DerivedAge.txt file in the UCD.

Jun 17 '20 04:06 wezm

This is a marvelous idea! Thanks for submitting it! :D

I'm not sure I can take a look at this in the next few weeks, but would love to have this feature. If you want to take a stab at it, I can probably give you enough guidance to get you started, though (:

Jun 17 '20 19:06 antifuchs

I might be able to take a look on the weekend. Did you have and preferences/thoughts regarding whether the version information was output by default?

Jun 17 '20 23:06 wezm

I think showing the version unconditionally would be just fine - chars is somewhat aggressively non-configurable and maximally informative for human users, so just adding it would work well (:

To add this feature, I think it's a two/three step process:

you'd add a task to fetch data file to the chars_data subcrate in the chars workspace here,
update write_name_data in the unicode portion to emit another table giving unicode versions & the ranges added in them (ideally make it a memory-optimized data structure; I don't extremely mind searching through n*13ish unicode versions for each character, but would be worried if we added a table mapping each character to a version number... maybe there's something one could do with tries though?)
Update the Codepoint Display impl's branch for Unicode here-ish to show the version number.

...and that's about it, I think! The main difficulty will probably be making a parser for that data file (the ones I made I got by with making a regex-based one, but feel free to use any other reasonable method, tbqh) and finding a decently space-efficient repr for the version table. Best of luck!

Jun 21 '20 13:06 antifuchs

I made a start on this yesterday. I'm 50–75% done. Fortunately I think what you described above matches what I did/planned to do 😃

Jun 21 '20 23:06 wezm

That's fantastic to hear - excited to see what you came up with (:

Jun 22 '20 03:06 antifuchs

chars chars copied to clipboard

Suggestion: Unicode version codepoint was added

chars
chars copied to clipboard