core String.slice is broken on strings with multi-byte Unicode characters

String.slice is broken on strings with multi-byte Unicode characters

Open billstclair opened this issue 5 years ago • 8 comments

The JavaScript implementing String.slice is not Unicode-enabled. It uses .slice, which is a byte-array operator.

SSCCE: https://ellie-app.com/3bs5VffDGP2a1

module Main exposing (main)

import Browser
import Html exposing (Html, text)

view : () -> Html ()
view _ =
    text <| String.dropLeft 1 "🙈🙉🙊"

main : Program () () ()
main =
    Browser.sandbox
        { init = ()
        , view = view
        , update = \_ m -> m
        }

This displays "�🙉🙊", not "🙉🙊", as expected.

A fix that works at https://github.com/elm/core/blob/1.0.0/src/Elm/Kernel/String.js#L151 is:

var _String_slice = F3(function(start, end, str) {
    return Array.from(str).slice(start, end).join('');
});

Note that I haven't actually tested that against an Elm program, but I did test the returned expression in a JS debugger. It has the problem that it makes a full copy of the string into an array. There is probably a way to do this without allocating any memory except the returned string.

String.left, String.right, String.dropLeft, and String.dropRight all call String.slice, so they're all broken.

This bug was also in 0.18 (https://github.com/elm-lang/core/blob/5.1.1/src/Native/String.js#L89)

Thanks to @stephenreddek in Elm Slack for finding this.

Aug 28 '18 22:08 billstclair

This affects slice, left, right, dropLeft, dropRight.

While 0.18 also used slice in the same manner, I believe that it is important to treat surrogate pairs consistently between all String functions. I ran into this issue when upgrading teepark/elmoji. It uses uncons to move through a list and look for matches to known emojis. Later, it calls dropLeft with the number of characters that it counted in the emoji match. This now causes an issue because uncons will see "🖖" as 1 character and dropLeft will treat it like 2.

For now, there are multiple Elm-only fixes such as using uncons n times or using List.drop after calling String.toList and finally using String.fromList.

EDIT: Regex.Match indexes are also unaware of surrogate pairs. It matches String.slice, but would not match uncons.

Aug 29 '18 14:08 stephenreddek

I am no longer convinced that this is a bug. Nothing forces a String to contain well-formed UTF-8. It's just a sequence of bytes. I use it that way in CustomElement.FileListener in billstclair/elm-custom-element, and am glad for that feature.

The String documentation says that "Strings are not lists of characters." This is pretty obvious from the String.fromList example (from an elm repl transcript):

> String.length <| String.fromList ['🙈','🙉','🙊']
6 : Int

Oct 15 '18 11:10 billstclair

I think the choice here is up to the language design. I think one of the important mentions is that operations on strings is not consistent. Unfortunately, the behavior as it is seems to bind Elm to the underlying JavaScript nuances with string handling.

It’s unfortunate that one might need to use a string type to represent bytes. Perhaps the work going on with elm in the binary space can allow strings to represent Unicode characters rather than 8 bit chunks.

Oct 19 '18 03:10 dynajoe

I think it's worthwhile to look at the Elixir implementation of strings. This conference talk is pretty enlightening: https://youtu.be/zZxBL-lV9uA

Oct 19 '18 05:10 norpan

Or, for those who prefer text: https://elixir-lang.org/getting-started/binaries-strings-and-char-lists.html

In Lisp, a string is a sequence of characters. A character is one unicode code point. Representations vary, but the lisp I use stores 64 bits per character by default. Lots of wasted space, but in these days of multi-gigabyte RAM, I haven't noticed a problem.

I'm now thinking that this is best fixed by documentation, just letting people know that a string is a sequence of bytes, and that those bytes are interpreted as UTF-8 when read or printed, but that other operations work just on the bytes.

Oct 19 '18 09:10 billstclair

One important thing to note is that often it's not enough to just consider each Unicode character as a separate character, but you need to include any combining characters following the base character and consider those as a single unit.

For example a string with "a" (U+0061) followed by "`" (U+0300) is shown as single character "à" while it is actually two Unicode characters and three bytes in UTF-8 format.

EDIT: That article about Elixir strings doesn't seem to consider this at all.

Oct 19 '18 11:10 malaire

More information here: http://unicode.org/faq/char_combmark.html

Computing the length or position of a "character" in a Unicode string can be a little complicated, as there are four different approaches to doing so, plus the potential confusion caused by combining characters. The correct choice of which counting method to use depends on what is being counted and what the count or position is used for.

So since this is such a complex topic, I personally prefer the solution used e.g. in PHP where string is just a sequence of bytes and nothing more. (Now PHP does have additional functions for handling multibyte characters, and Elm could also add either new type or new functions for some use cases, but I think that base string should be just bytes.)

Oct 19 '18 12:10 malaire

Just a note: Elm/JavaScript strings are not byte arrays as said in some previous comments but arrays of UTF-16 code units.

Oct 09 '19 08:10 malaire

core core copied to clipboard

String.slice is broken on strings with multi-byte Unicode characters

core
core copied to clipboard