Character unescaping improvements

Open ForNeVeR opened this issue 1 year ago • 0 comments

Some issues with the current code in Cesium.CodeGen.Ir.Expressions.Constants.CharConstant.UnescapeCharacter and Cesium.Parser.TokenExtensions.UnwrapStringLiteral:

[ ] There are two of them, with different implementations. There should be only one.
[ ] UnescapeCharacter doesn't support \u and \U aka universal-character-name from the standard.
[ ] UnescapeCharacter also has a bug in handling octal and hex sequences: both are considered to only have two digits, with special treatment of \0. While the standard defines octal sequences to be either one, two or three characters long, while the hex escapes are of arbitrary length.
[ ] \0 should not be a special case in either of the methods; it is just an octal number.
[ ] UnwrapStringLiteral also seems to treat octal sequences weirdly: I only see support for octal numbers starting from 0 which is not correct (UnescapeCharacter handles these better).
[ ] Normal compiler behavior is to report a warning on an invalid sequence (e.g. \m) and treat it as the character itself. We don't do this: we either silently accept or break on such sequences.