chars
chars copied to clipboard
Suggestion: Unicode version codepoint was added
I deal with Unicode a fair bit and chars
is a handy tool. Sometimes it would be convenient to know which Unicode version assigned a particular codepoint.
E.g the output from chars
might look something like this. The version information might not be shown by default and require a command line flag if it was deemed too noisy.
$ chars party
U+0001F973, 🥳 0x0001F973, \0374563, UTF-8: f0 9f a5 b3, UTF-16BE: d83edd73
Width: 2, prints as 🥳
Quotes as \u{1f973}
Unicode name: FACE WITH PARTY HORN AND PARTY HAT
Unicode version: 11.0
U+0001F389, 🎉 0x0001F389, \0371611, UTF-8: f0 9f 8e 89, UTF-16BE: d83cdf89
Width: 2, prints as 🎉
Quotes as \u{1f389}
Unicode name: PARTY POPPER
Unicode version: 6.0
I think the information is available via the DerivedAge.txt
file in the UCD.
This is a marvelous idea! Thanks for submitting it! :D
I'm not sure I can take a look at this in the next few weeks, but would love to have this feature. If you want to take a stab at it, I can probably give you enough guidance to get you started, though (:
I might be able to take a look on the weekend. Did you have and preferences/thoughts regarding whether the version information was output by default?
I think showing the version unconditionally would be just fine - chars
is somewhat aggressively non-configurable and maximally informative for human users, so just adding it would work well (:
To add this feature, I think it's a two/three step process:
- you'd add a task to fetch data file to the
chars_data
subcrate in the chars workspace here, - update
write_name_data
in the unicode portion to emit another table giving unicode versions & the ranges added in them (ideally make it a memory-optimized data structure; I don't extremely mind searching throughn*13ish
unicode versions for each character, but would be worried if we added a table mapping each character to a version number... maybe there's something one could do with tries though?) - Update the Codepoint
Display
impl's branch for Unicode here-ish to show the version number.
...and that's about it, I think! The main difficulty will probably be making a parser for that data file (the ones I made I got by with making a regex-based one, but feel free to use any other reasonable method, tbqh) and finding a decently space-efficient repr for the version table. Best of luck!
I made a start on this yesterday. I'm 50–75% done. Fortunately I think what you described above matches what I did/planned to do 😃
That's fantastic to hear - excited to see what you came up with (: