smlfmt Support mlb option "allowExtendedTextConsts true"

Currently, smlfmt will report an error on non-ascii input.

Example file:

val a = "🍰"

Error message:

-- SYNTAX ERROR ----------------------------------------------------------------

Invalid character.

test.sml
  | 
1 | val a = "🍰"
  |          ^

Strings can only contain printable (visible or whitespace) ASCII characters.

Expected behavior

Strings need to handle UTF8 non-ascii characters.

Apr 11 '22 19:04 UltimatePea

Supporting this won't be too bad, but will require changes in a few places.

For the lexer, we'll need to skip over UTF8 characters in the function advance_oneCharOrEscapeSequenceInString. Note that this function already skips over escape sequences; handling UTF8 should be similar. And then we can selectively enable this functionality by adding an additional flag to the lexer functions Lexer.next and Lexer.tokens.

We'll need to update the implementation of Source, too, such as Source.absoluteStart which returns the position (line and col) of a source file segment. Currently these are computed via byte offsets, which is no longer correct under UTF8. I believe other functions will need to be updated, too, to ensure that a Source.t never starts or ends in the middle of a UTF8 sequence.

Apr 11 '22 20:04 shwestrick

By the way, what is the accepted standard practice these days for visually handling "characters" that are encoded as more than one UTF8 character? E.g., the flag emoji "🇺🇸" is actually two UTF8 characters ("🇺" followed by "🇸"). But of course, it is intended to be visually represented as a single character.

My initial thought is that this is important for smlfmt because we need to know positions to vertically align things correctly. Do we use the UTF8 semantic position, or the intended visual position? I'm inclined to use UTF8 semantic position...

Apr 11 '22 20:04 shwestrick

Thanks for the info!

I am not very familiar with UTF8/Unicode, but I would suggest we at least fix the lexer to not produce an error when encountering a UTF8 character.

I am not so familiar with the difference between semantic position and visual position, so I vote for whatever is easier to implement, which is probably UTF8 semantic position.

Apr 11 '22 21:04 UltimatePea

It occurred to me that a simpler way to support this is to allow for UTF-8 bytes but not check for validity of a UTF-8 byte sequence. #74 implements this.

By default, this is disabled. It can be enabled with -allow-extended-text-consts true at the command-line, or with the "allowExtendedTextConsts true" annotation within an MLB.

Your example above should now be working. Let me know if you have any trouble!

Jan 09 '23 18:01 shwestrick