opendylan Implement full Unicode support

Assess the current state of Unicode support and fix any limitations.

Jul 02 '18 15:07 cgay

I'm thinking the <string> class should support Unicode (UTF-8) and <unicode-string> should be removed. No <byte-string> class, just <byte-vector>, with easy conversion between <byte-vector> and <string> when needed. No doubt some existing code uses <byte-string> for non-ASCII data, and that would need to be switched to use <byte-vector> instead.

Nov 13 '18 00:11 cgay

Makes sense. One reason for keeping <unicode-string>s would be to make it easier to work with Win32 API.

Nov 13 '18 15:11 pedro-w

Keeping them wouldn't help for dealing with Win32 at all. When calling out to any native API, you need to potentially convert to the right encoding. UTF16 for Windows, I guess. Although I heard that new Windows may start doing UTF8. Anytime that you're accessing file systems, you need to do some normalization probably.

This is partly why Rust has an OsStr type apart from the regular UTF8 encoded strings.

Nov 13 '18 15:11 waywardmonkeys

Doesn’t represent strings with elements that are sixteen bit Unicode characters, i.e. what Win32 requires?

Nov 13 '18 15:11 pedro-w

It does, but that is dumb. The definition in the DRM makes <unicode-string> assume UTF-16 encoding, and further implies that a <character> is either an encoded code point or simply an arbitrary word in some binary data that is supposed to be a string.

I feel that <byte-string> is intended for encoded legacy text data like Mac OS Roman encoding or Windows Latin-1, and should carry encoding information as part of it.

I have no problem getting rid of it, however, we do need some way of interoperating with C code. I suggest making <byte-string> a subclass of <string> with a bit of encoding information. Conceptually, it would function as a string with a limited character repertoire (that being the characters allowed by the encoding), as opposed to <unicode-string> which has an unlimited character repertoire, but both would still be a sequence of <character>. In the case of <byte-string>, only characters that fit the repertoire would be allowed—this would be similar to how limited integers work.

Both could use the same internal representation and use the same method implementations on <string>. The <byte-string> characters would have to be translated to or from a specific encoded form to interoperate with C code (which may be trivial if the internal representation of <string> already happens to be in that encoded form, such as the UTF-16 case).

Nov 15 '18 02:11 BarAgent

https://github.com/dylan-lang/opendylan/wiki/Unicode has some of Bruce's thoughts on unicode support.

May 01 '19 12:05 cgay

Also relevant, in my opinion, is Swift's recent change to String implementation: https://swift.org/blog/utf8-string/

May 01 '19 14:05 pedro-w

Peter S. Housel @housel May 01 11:00 I think it was Bruce Hoult who pointed out that having a single <string> class would save a lot on dispatch So I'm thinking that a single, most likely immutable, UTF-8-based <string> class would be the best choice
Carl Gay @cgay May 01 11:04 +1
Peter S. Housel @housel May 01 11:05 The default view should be <character> = Unicode scalar; extended grapheme clusters could be provided as an iteration-protocol in a unicode library but I wouldn't go as far as Swift did in making them the default view

May 03 '19 05:05 housel

If <string> is UTF-8-based, we don't gain much by making <character> a code point instead of a grapheme cluster.

In UTF-8, an individual code point can still cover a variable number of bytes, just as a grapheme cluster can, so iteration would have to account for that anyway. And basic equality testing between characters and strings would require more care and attention to detail on the part of the developer (to ensure canonical encoding between alternatives), which is always to be avoided if possible.

May 06 '19 21:05 BarAgent

Code points can be represented by a 2-bit tag + 21-bit code point index. There is no standard canonical encoding of extended grapheme clusters, and not all of them can be represented as single-word values without allocating storage. From an implementation standpoint, it's a can of worms I'd rather not open. Rust, Go, and Python 3 all use code points as the default view, as does the XML standard information set.

There is a big gap in complexity between just decoding multi-byte UTF-8 sequences into unicode scalars and doing full UAX #29 segmentation into extended grapheme clusters.

May 06 '19 21:05 housel

Is it correct to say that segmentation of UTF-8 into code points can be done just by looking at the bit-patterns? Certainly, segmenting into extended grapheme clusters looks a lot more complicated than that. I suppose the clusters would have to be returned as a sequence/iteration of small strings?

May 07 '19 05:05 pedro-w

Yes, UTF-8 can be interpreted quite efficiently using a simple DFA https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

Rust and Go libraries I've seen that do cluster segmentation do indeed iterate through a sequence of small strings.

May 07 '19 15:05 housel

@promovicz did some related work here: https://github.com/dylan-lang/opendylan/tree/master/sources/app/unicode-data-generator

May 08 '19 01:05 cgay

It is true that grapheme clusters are basically small strings rather than simple words, but is that really a problem? Yeah, code points are simpler, but they are also basically useless. Any practical work involving string inspection is going to need grapheme cluster boundaries anyway, unless you are dealing with straight-up ASCII. Even basics like ü could be represented by two code points or one.

Also, I'm sure you've seen, but they give us a regex to figure out grapheme cluster boundaries: https://unicode.org/reports/tr29/#Table_Combining_Char_Sequences_and_Grapheme_Clusters

May 12 '19 07:05 BarAgent

Hey everyone, nice to see some Dylan discussions.

Back when I wrote that UCD generator, the intention was for Dylan to use code points as its primary character representation. The intention for strings was to use either UTF-16 or UCS-4, with the former providing a size advantage (UTF-16 is smaller) and the later providing simpler manipulation (UCS-4 needs no surrogate pair decoding). IIRC the Unicode standard recommends UTF-8 for I/O only, and the intention was to follow that. Most languages seem to use UTF-16.

Most practical 'string inspecting' happens in 'logical character order' - ignoring things such as bidirectional text and grapheme clusters. This is sufficient for string-splitting, (up/down/title)-casing of text, tokenization as well as basic search and comparison including regular expressions. All of these things should be done on strings, not characters or codepoints. Note however that normalization has to be considered whenever strings are compared as text. This is one of the reasons for having a database generator. Much software uses libICU for that.

Proper editing of Unicode text requires more than just grapheme clusters. Bidirectional text needs to be considered for that, and supporting non-western writing systems tends to add extra complexities. Some sort of glyph representation is useful here. Text rendering libraries like Pango do things like that.

Internationalizing programming languages is a whole different story. See for example the Arabic Scheme 'Qualb': https://github.com/nasser/---. There tend to be practicality issues with such things, even on the character level (which is why we write 'lambda' not 'λ').

Greetings and Best Wishes prom

May 13 '19 04:05 promovicz

There is of course also the argument that interpreting UTF-8 will easily fit in L1 cache these days, and it can therefore be used instead of UTF-16. I guess this indeed works best in combination with immutable strings, which are a sensible strategy considering that in-place work in strings is just not something one should do in Unicode. One might consider this to be the more novel and pragmatic approach. Not sure I have a strong opinion here.

May 13 '19 04:05 promovicz

Apologies for my lack of sophistication on this topic, but I just wanted to drop a couple more thoughts here and learn from any responses they generate. :) In no particular order...

It seems possible to me that strings might change their position in this diagram, or possibly not be considered [edit] ~~collections~~ sequences at all: https://opendylan.org/books/drm/Collection_Classes#XREF-1400 I.e., if a string is a sequence of (UTF-8) bytes does iterating over it give you a unicode code point at a time, as in Go? What do copy-sequence, size, et al do? I find the Go way (len counts bytes and range iterates over code points) to be a real potential "gotcha": https://play.golang.org/p/MjhCHc7lkhh
To aid in transitioning to Unicode we could create a <text> class with which to incrementally replace <string> and at the same time drop some of the mental baggage we all (?) have related to the term "string". A lot of work could be done to make <text> and its API work as we want before we have to start implementation of run-time and compiler support. Similar for <char> or <rune> instead of <character>. (Plus, shorter names than <string> and especially <character> are a win in my book.)
We have a tag allocated to characters and a tag allocated to Unicode characters. https://opendylan.org/documentation/hacker-guide/runtime/object-representation.html Assuming we no longer have a distinction between the two I assume we can free up one of these tags. What are the implications of this? Is there a big win in using a tag for some other type? Could we reduce to a single tag bit?

May 20 '19 20:05 cgay

Strings would change their position in that they would still be in <sequence> (though not <mutable-sequence>). For ASCII-only strings, the representation size would equal the string size so that element returning a Unicode scalar (code point) would be constant time. (The time complexity of forward iteration with an iterator, the usual case, would also be unaffected.) Otherwise element could not be guaranteed to be constant time so <string> would not be a <vector>.
Not sure I think this is needed. To do real human "text" you really need some sort of markup representation like the HTML5 or XML DOM. We should provide facilities to make working with this sort of rich text more convenient if we can.
We can get rid of the #b11 tag currently used for <unicode-character> but I'm not sure yet what to use it for. One possibility is to use #b01 for even <integer> and #b11 for odd <integer>.

May 21 '19 03:05 housel

One possibility is to use #b01 for even and #b11 for odd

That's intriguing, would that just be to squeeze an extra useful bit into the integer representation? As an alternative, on 64-bit platforms you could store (single) floats

(edited for formatting only -cgay)

May 21 '19 05:05 pedro-w

Yes, doing <single-float> would probably be more useful than just adding a single bit to <integer>. Moving integer register values to and from FP registers possibly stalls the pipeline pretty bad, though. Experimentation needed.

May 21 '19 09:05 housel