Add Unicode escaping
Nit should have literal Unicode escape sequence \u008B and \U0000080B added in escape_to_nit and unescape_nit. I assume the change in the lib will be automatically used by the Nit compiler and tools
https://github.com/nitlang/nit/pull/2459#discussion_r118020201
I think this should only be added to unescape_nit, nothing should be done C-wise, except maybe compile every non-ASCII Unicode character to their byte representation as \x sequences (this could probably become a compatibility option at some point).
There is however one point that may need some discussion: invalid UTF-8 code-points. Should we replace invalid UTF-8 sequences like surrogates? My guess would be yes, but this would mean that valid strings in languages like Java or .Net languages might not be understood in Nit. Also, adding surrogate support will likely induce some performance hit due to the lookahead, though it will probably be minor since \u escape sequences are not that popular, and especially surrogate ones.
Other than that, since Unicode is limited to 10FFFF, no \U escape sequence should support more than that.
And tool-wise, there will probably be one minor modification to the grammar if we want to support this kind of sequences, but this should not be too much of a hassle to implement.
PR will likely follow in the next couple of days
One problem I think we have to consider is that people are used to \u and \U to always take 4 digits (so they are limited to BMP or UCS-2/UTF-16 wydes). It is especially important for strings like "1\u00A0000\u00A0000" (1 000 000).
This is linked with what I pointed out yesterday, imo the spec should be something along the lines of:
- Allow (u|U)[0-9A-Fa-f]{1,6}
- Disallow characters above the Unicode maximum (0x10FFFF)
The only question remaining is what to do with surrogate pairs, should we allow them?
In some languages (C# being one example if my memory serves me right), \u and \U have different masks. The capital one expects 8 digits while the other expects 4. This feels like an example of what not to do if you ask me
On 24 May 2017 10:22 am, "Jean-Christophe Beaupré" [email protected] wrote:
One problem I think we have to consider is that people are used to \u and \U to always take 4 digits (so they are limited to BMP or UCS-2/UTF-16 wydes). It is especially important for strings like "1\u00A0000\u00A0000" (1 000 000).
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/nitlang/nit/issues/2461#issuecomment-303739150, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYL2Qf63ErL7CvkCDTnYux-PO0RYA3eks5r9D0rgaJpZM4Nj4wa .
IMO, since you allow non-BMP code points, the behavior should be in sync with Int::code_point, in order to avoid confusion.
what is the behavior of JS and python on surrogate pairs and on \u vs \U?
JS and JSON was designed for UCS-2/UTF-16. So, they simply handle them as UTF-16 prescribe. Furthermore, \u must always be written with a lower case u (U+0075).
Sources:
- ECMAScript 2016, section 11.8.4
- ECMA-404, section 9
- RFC 7159, section 7