rascal icon indicating copy to clipboard operation
rascal copied to clipboard

char-class type reificiation does something wrong for high surrogate/low surrogate pairs

Open jurgenvinju opened this issue 1 year ago • 3 comments

rascal>charAt("🍝", 0)
int: 127837
rascal>char(127837)
[🍝]: ([🍝]) `🍝`
rascal>#[🍝]
type[[?]]: type(
  \char-class([range(55356,55356)]),...

So the unicode codepoint 127837 is the right codepoint for 🍝, but type-reification in a character class type turns it into codepoint 55356 whch does not even have a graphical representation in the current font: ?. Maybe it's not even a codepoint.

jurgenvinju avatar Jul 21 '24 09:07 jurgenvinju

This goes wrong in the runtime system of the interpreter which shares a lot of code with the runtime code for the compiler, so probably this breaks everywhere.

jurgenvinju avatar Jul 21 '24 09:07 jurgenvinju

@jurgenvinju as it's represented as a surrogate pair, what tends to go wrong is that you only see the high surrogate part of the pair. 🍝 is encoded as 0xD83C & 0xDF5D. 55356 in hex is 0xD83C.

So the bug is: somewhere the type generation takes a java string and gets the first char (charAt(0) I assume), instead of correctly using codepointAt(0).

DavyLandman avatar Jul 21 '24 09:07 DavyLandman

Then this is the cause: https://github.com/usethesource/rascal/blob/a23db0e94de06ecdd82963757c10ce087ff44a23/src/org/rascalmpl/values/parsetrees/SymbolFactory.java#L363

jurgenvinju avatar Jul 21 '24 10:07 jurgenvinju