grain icon indicating copy to clipboard operation
grain copied to clipboard

Char Unicode Data and Conversions

Open peblair opened this issue 3 years ago ā€¢ 9 comments

It would be useful for Char to support a variety of Unicode-aware query functions and conversion functions (toUpper, isPunctuation, etc). For example, these are the ones supported by Racket here and here.

We should try to add as many of these as possible, as one never knows what might be useful for libraries.

peblair avatar May 17 '21 20:05 peblair

I'm currently working on emitting JSON in Grain and for escaping I need to generate an UTF-16 surrogate pair from a unicode codepoint. And of course vice-versa, but parsing is still a long way off.

cician avatar May 29 '21 23:05 cician

@cician My question is a tad unrelated to this issue, but I'm not sure I understandā€”for what you're trying to accomplish, why do you need to make surrogate pairs? Grain strings are UTF-8.

ospencer avatar May 29 '21 23:05 ospencer

Actually I don't strictly need it because only ASCII codes 0-31 need to be escaped for conforming JSON output in UTF-8, but I've tentatively added an option to escape all non ASCII characters.

The ECMA-404 spec (https://www.ecma-international.org/publications-and-standards/standards/ecma-404/) says the escaping should be done in UTF-16 pairs, unless I misunderstand something. I'm learning in the process about both unicode and Grain. I think it's a consequence of the fact that JSON inherits some properties from JavaScript, which doesn't use UTF-8 internally. It spills to how escaping is done in JavaScript strings and thus JSON.

PS: I'm working on it here.

cician avatar May 30 '21 00:05 cician

Ah I see, it's the specification for unicode character escapes that appear within JSON object strings. Got it. That's interesting! So you'd want a utility like Char.escapeSurrogatePair : Char -> String that would take a char and return its unicode escape as a surrogate pair, e.g. assert Char.escapeSurrogatePair('š„ž') == "\\uD834\\uDD1E"? That'd differ from Char.escape which would just produce "\\u{1D11E}" for regular Grain strings, yeah?

ospencer avatar May 30 '21 01:05 ospencer

Or I guess it could just be called escapeUtf16.

ospencer avatar May 30 '21 01:05 ospencer

For now I've just copied a few lines from OpenJDK's source to do the job, but I should probably remove it to avoid copyright/licensing issues.

I don't think escapeUtf16 makes much sense as a standalone function as opposed to be part of the JSON specific code, unless we want to build a library like this: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html.

In java's standard library there are simply two functions like this:

char highSurrogate(int codePoint);
char lowSurrogate(int codePoint);

In Grain it woudn't make sense to return Char though. These would rather be just numbers with its own specific meaning in unicode slang.

cician avatar May 30 '21 10:05 cician

@peblair I am currently trying to implement the unicode aware functions by generating code based on the Unicode data files (for example https://unicode.org/Public/UNIDATA/UnicodeData.txt). This results in several thousand lines of Map.set code and I read in the Contributing instructions that it should all be contained in a single file. Can I extract the code to another file for readability purposes, or should I just put it all in the char file?

FinnRG avatar Jun 18 '22 11:06 FinnRG

@FinnRG Thanks for doing some work on this! I think it would make sense to have the data in a separate file, but we may want to hold off on the effort briefly. Once #1330 lands, we will have a more coherent way of working with WASM data sections in Grain, which I think can give us a much more efficient way of storing the data in UnicodeData.txt (that way we avoid having thousands of Map.set calls on startup).

peblair avatar Jun 18 '22 11:06 peblair

Rust has this little tool for generating efficent bitsets and functions from the spec. https://github.com/rust-lang/rust/tree/master/src/tools/unicode-table-generator I think with a minimal amount of work we could have this generate grain code instead.

spotandjake avatar Dec 29 '23 20:12 spotandjake