rfcs Unicode String

RFC for changes to String class to better support unicode.

Aug 03 '20 02:08 rowland66

Hi Joe,

Thanks for providing feedback so quickly. I have make some comments below.

Rowland E. Smith +1-201-396-3842 On 8/3/20 11:29 AM, Joe Eli McIlvain wrote:

@jemc commented on this pull request.

Thanks for putting this together. I have left some comments about things I would prefer to see changed.

Also, I suspect that this RFC may not be totally complete yet, in terms of tracking what all of the implications would be. I think it would be valuable for someone to start implementing this change (particularly the method signatures, even if the code for encoding/decoding is not fully robust yet) so that we could examine in more detail all of the things that will be affected by this change.

I also suspect there could be certain implications within this RFC that may be implicitly understood by the author, but may be not clear to the reader, resulting in confusion about exactly how these changes will look. The step I mentioned above of putting together a proof of concept level implementation could also help with this problem of potential misunderstandings of this very complex change.

I agree that a sample implementation would help. I have 70% of an implementation done. How I go about sharing that with the group? Create a pull request?

In text/0000-unicode-string.md https://github.com/ponylang/rfcs/pull/179#discussion_r464472815:

+Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32). + +class val String is (Seq[U32] & Comparable[String box] & Stringable) + +Where String functions took parameters of type U8 or returned values of U8, they will take or return values of type U32. All indexes will be interpreted as codepoints. Only functions that manage String allocated memory will continue to refer to bytes. + +The following additional changes will be made to the String class: +1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes. +1. The truncate() function will only support len parameter values less than the string size. +1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed. +1. The insert_byte() function will be changed to insert_utf32() +1. The values() function will return an iterator over the string codepoints. Same as runes(). +1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes. + +Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate.

Adding these type names to the builtin package means that no Pony program will be able to use those type names for user-defined types. The words |Encoder| and |Decoder| are incredibly general, and I can imagine many Pony programs or libraries using them to refer to things other than this string encoding use case.

I suggest using the names |StringEncoder| and |StringDecoder| instead.

I agree, I will make the change.

In text/0000-unicode-string.md https://github.com/ponylang/rfcs/pull/179#discussion_r464477233:

+1. The insert_byte() function will be changed to insert_utf32() +1. The values() function will return an iterator over the string codepoints. Same as runes(). +1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes.

+Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate. +``` +trait val Encoder

fun encode(codepoint: U32): (USize, U8, U8, U8, U8)

+trait val Decoder

fun decode(bytes: U32): (U32, U8) +```

+The ByteSeq type defined in std_stream.pony will be changed to remove String. +``` +type ByteSeq is (Array[U8] val)

There's not much point in having this type alias if it is not a union.

The purpose of this type is to define the types for which it is safe to call |cpointer| and |size| to pass to FFI calls. In the standard library it is only safe to do those for standard library types, since we can't trust any user-defined types that implement the |cpointer| and |size| interface to give us a pointer that is valid for the given number of bytes.

Given that |size| is no longer the right method to call (it needs to call |byte_size| instead, I recommend just removing this union altogether and using |(String | Array[U8] val)| in places where it is currently used, and in each such place we would need to introduce a |match| statement that checks which type it is and uses |size| for Array and |byte_size| for String.

I agree. I left the type in to improve backward compatibility.

In text/0000-unicode-string.md https://github.com/ponylang/rfcs/pull/179#discussion_r464478365:

+Many functions that accept ByteSeq as a parameter will be changed to accept (String | ByteSeq) as a parameter.

+A new StringIter interface will be added to std_stream.pony +``` +interface val StringIter

"""

An iterable collection of String box.

"""

fun values(): Iterator[this->String box] +```

+Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed. + +Change Writer in buffered/writer.pony to accept Encoder parameters in the write() and writev() functions. + +Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.

I think you meant to put this in the |files| package (where it can live alongside |FileLines|) rather than the |buffered| package.

⬇️ Suggested change -Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class. +Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class. Yes, that is a mistake. I will fix.

In text/0000-unicode-string.md https://github.com/ponylang/rfcs/pull/179#discussion_r464479525:

+Unicode is an accepted standard for representing text in all languages used on planet earth. Modern programming languages should have first class support for unicode.

+# Detailed design + +Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32). + +class val String is (Seq[U32] & Comparable[String box] & Stringable) + +Where String functions took parameters of type U8 or returned values of U8, they will take or return values of type U32. All indexes will be interpreted as codepoints. Only functions that manage String allocated memory will continue to refer to bytes. + +The following additional changes will be made to the String class: +1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes. +1. The truncate() function will only support len parameter values less than the string size. +1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed. +1. The insert_byte() function will be changed to insert_utf32() +1. The values() function will return an iterator over the string codepoints. Same as runes().

I suggest that we rename |values| to |bytes| instead of removing it entirely, so that the user can still iterate over bytes when they need to, without introducing an allocation to convert to |Array[U8] val|

||A bytes function would need to take an Encoder as a parameter, but I am OK with adding it.

In text/0000-unicode-string.md https://github.com/ponylang/rfcs/pull/179#discussion_r464486203:

+This can be presented as a continuation of existing Pony patterns.

+The Pony tutorial will need to be updated to reflect these changes. + +# How We Test This + +Extend CI coverage to cover these changes and new features. + +# Drawbacks + +This is a change to the String API, and as such will break existing programs. String functions returning or taking as parameters U8 values now returning or taking U32 values will probably be the most common source of program breakage. Also, programs that used the existing unicode friendly functions in String will need to change, but they should be greatly simplified. + +# Alternatives + +1. Leave the String class as it is. This is likely to result in hidden errors in many programs that work with Strings as they will not work correctly with unicode data if they encounter it. It will also make it difficult to use Pony in settings where ASCII is not sufficient for local language (most non English speaking countries). +1. A more complicated implementation with multiple String types capable of storing text data internally using different byte encodings. This approach would improve performance in situations where strings needed to be created from bytes of various encodings. The String type could be matched to the native byte encoding to eliminate any need for conversion. Such an implementation would add complexity to the String API as it would require use of a String trait with multiple implementations. It would also add considerable complexity to the String class implementation.

As I mentioned in our discussion in Zulip https://ponylang.zulipchat.com/#narrow/stream/192795-contribute-to.20Pony/topic/Changes.20to.20String.20class/near/202574862, I would prefer to see a version of this RFC that takes the encoding as a type parameter to all encoding-aware operations within the |String| class, using UTF-8 as the default (so that everything in this RFC works for the user as you describe by default), but so that non-UTF-8 use cases can also be supported.

This does not necessarily mean you need multiple String classes or a String trait - just multiple encoding classes and an encoding trait (I believe you already have an encoder/decoder traits as part of this RFC, so that could potentially be used).

I am not sure what you mean by the "encoding-aware operations". I think that what would work would be adding an "internal encoding" parameter to all of the String constructors. This would require storing the internal encoding in every String instance, so some impact on memory footprint. Adding an "internal encoding" parameter to every encoding-aware operation would cover many operations. For example, the eq function would need to be encoding aware because it takes a String as a parameter. It would not be possible to compare 2 strings without knowing how the codepoints were encoded in each of them. It would also require that any class that takes a String as a parameter would need to know the strings are internal encoding in order to call its functions. That does not sound workable.

Multiple encodings would also complicate FFI. Even in the standard lib, strings encoded with UTF-16 would require some changes in the pony runtime.

I would prefer to defer this enhancement as it could be implemented later with minimal disruption if needed, and it would significantly increase the complexity and work required to make this String change.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ponylang/rfcs/pull/179#pullrequestreview-460103401, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEQLV4FJR2Z4KZP66XU3WDR63JXNANCNFSM4PS4AMOQ.

Aug 04 '20 14:08 rowland66

Hi @mfelsche,

Thanks for the feedback. Comments below.

Rowland E. Smith +1-201-396-3842 On 8/3/20 4:25 PM, Matthias Wahl wrote:

@mfelsche commented on this pull request.

Thanks for creating this RFC!

I left a few comments.

The main concern here is that this very substantial change to the language is missing some motivational comments. Especially: What will be improved? What will this change make possible, which isnt possible yet?

I wanna see the costs of implementing this RFC be justified.

In text/0000-unicode-string.md https://github.com/ponylang/rfcs/pull/179#discussion_r464634959:

@@ -0,0 +1,97 @@ +- Feature Name: unicode-string +- Start Date: 2020-07-30 +- RFC PR: (leave this empty) +- Pony Issue: (leave this empty)

+# Summary + +Change the builtin String class to present a sequence of unicode codepoints as 4 byte numbers (U32) instead of a sequence of bytes. All conversions from a sequence of bytes to a String, or from a String to a sequence of bytes will require specifying a decoding or encoding respectively. The default encoding will be UTF-8, and Strings will be represented internally as a sequence of UTF-8 encoded codepoints. Provide encoders/ decoders for UTF-8, UTF-16, UTF-32, ASCII and ISO-8859-1 as part of the stdlib. Change the character literal to represent a single unicode codepoint and produce a UTF-32 codepoint value.

It is not 100% clear to me what the internal representation of String will be. Will it remain a pointer and a size only that the bytes pointed to will be actual |U32| codepoints without encoding? Or will the actual underlying data remain encoded? This RFC should also state why this approach was chosen and the other ones dicarded.

The internal representation of String will remain as it is now except that the pointer will always reference a UTF-8 encoding of the strings codepoints. Today, there is no such certainty.

In text/0000-unicode-string.md https://github.com/ponylang/rfcs/pull/179#discussion_r464639710:

+The following additional changes will be made to the String class: +1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes. +1. The truncate() function will only support len parameter values less than the string size. +1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed. +1. The insert_byte() function will be changed to insert_utf32() +1. The values() function will return an iterator over the string codepoints. Same as runes(). +1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes. + +Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate. +``` +trait val Encoder

fun encode(codepoint: U32): (USize, U8, U8, U8, U8)

+trait val Decoder

fun decode(bytes: U32): (U32, U8)

What is the meaning of the elements of the returned tuple? Should this function be partial considering that not all byte-combinations are valid in every encoding?

The tuple returned by the encode function contains the number of bytes required to encode the codepoint (1-4), and the byte values of the encoded codepoint. Perhaps it would be more efficient to compact the 4 bytes in a single U32. That is what is done with decode. In the case of bytes that do not correctly decode into codepoints, the unicode replacement character (0xFFFD) is inserted for each invalid byte consumed. This is how the existing UTF-8 decoding function worked, and I think it is the best approach. Without it, many String constructors become partial, and that is a big backward compatibility issue.

In text/0000-unicode-string.md https://github.com/ponylang/rfcs/pull/179#discussion_r464641600:

+The following additional changes will be made to the String class: +1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes. +1. The truncate() function will only support len parameter values less than the string size. +1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed. +1. The insert_byte() function will be changed to insert_utf32() +1. The values() function will return an iterator over the string codepoints. Same as runes(). +1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes. + +Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate. +``` +trait val Encoder

fun encode(codepoint: U32): (USize, U8, U8, U8, U8)

+trait val Decoder

fun decode(bytes: U32): (U32, U8)

Maybe a better signature/approach for encoding/decoding would be to encode a whole String into an Array[U8] and vice versa for decoding. This might seem more work on the implementor of |Encoder|/|Decoder|, but might allow to employ more tricks and be slightly more performant/convenient for some encodings. I also suspect, that implementing a special |Encoder|/|Decoder| is fairly rare.

The Encoder/Decoder classes are used in a variety of situations. When reading from a file or a socket, the bytes read might not decode to a characters completely. An additional few bytes from the next read might be required. This approach deals with that situation with reasonable ease. Yes, I don't expect that there will be very many Encoder/Decoder's.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ponylang/rfcs/pull/179#pullrequestreview-460306998, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEQLV644N6VX2AEHQSKKHTR64MK3ANCNFSM4PS4AMOQ.

Aug 04 '20 15:08 rowland66

I agree that a sample implementation would help. I have 70% of an implementation done. How I go about sharing that with the group? Create a pull request?

You don't necessarily need to make a pull request - you can just make sure your branch is public on GitHub in your fork of the ponyc repo, then share the link here so we can go inspect your branch as we read the RFC.

Aug 04 '20 17:08 jemc

Regarding some of your other responses - I'm a bit confused currently, but I'll wait and look at your sample implementation so I can be sure I understand your comments correctly before I respond.

Aug 04 '20 17:08 jemc

Thanks Joe

I will distribute the link as soon as I can. My power is out due to the storm on the east coast, so it may take me a while. Also it looks like am going to miss the weekly sync call.

On Tue, Aug 4, 2020, 1:02 PM Joe Eli McIlvain [email protected] wrote:

I agree that a sample implementation would help. I have 70% of an implementation done. How I go about sharing that with the group? Create a pull request?

You don't necessarily need to make a pull request - you can just make sure your branch is public on GitHub in your fork of the ponyc repo, then share the link here so we can go inspect your branch as we read the RFC.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ponylang/rfcs/pull/179#issuecomment-668714186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEQLV3E73QK6M4F5E3KZRDR7A5MFANCNFSM4PS4AMOQ .I

Aug 04 '20 17:08 rowland66

I'm worried about the encoding, and it wasn't clear to me if the proposal would leave that unchanged and merely implement the Seq trait as though it were a sequence of U32 elements, or make the representation more pervasive.

With a representation as a subset of byte sequences we can have Array[U8] val -> String? and fun bytes(): this->Array[U8] box which would remain safe and allocation-less, enabling more interop and smoother interaction with existing code.

Aug 05 '20 03:08 jasoncarr0

Hi All,

My house has finally gotten electric power back, and I have been able to fork the ponyc repo and push a branch with my string class changes. The URL for the branch is: https://github.com/rowland66/ponyc/tree/string

I hope that this make some of the items that we have been discussing clearer. I am new to GitHub, so if you have any trouble with access, please let me know.

Rowland E. Smith +1-201-396-3842 On 8/4/20 1:02 PM, Joe Eli McIlvain wrote:

I agree that a sample implementation would help. I have 70% of an
implementation done. How I go about sharing that with the group?
Create
a pull request?
You don't necessarily need to make a pull request - you can just make sure your branch is public on GitHub in your fork of the ponyc repo, then share the link here so we can go inspect your branch as we read the RFC.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ponylang/rfcs/pull/179#issuecomment-668714186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEQLV3E73QK6M4F5E3KZRDR7A5MFANCNFSM4PS4AMOQ.

Aug 07 '20 00:08 rowland66

Rowland has updated the RFC in response to comments.

@jemc @jasoncarr0 @mfelsche can you have another look?

Aug 11 '20 18:08 SeanTAllen

Some notes from the post-sync-call on suggestions to improve the performance and flexibility of various use cases, based on some limited review of Rowland's branch so far:

add embed _array: Array[U8] as the field within String, instead of the three fields we have now that happen to be the same as Array[U8]
- this makes it so that string to array conversions can be zero allocation - every string already contains an array that it can share as a box/val. See the next bullet point for more info;
fun val array: Array[U8] val become fun array: this->Array[U8] box
- it is a superset of what we have now, and allows a readable "byte string" reference to a String ref, rather than requiring val.
add whatever methods we need to add Array[U8] that are restricted by A:U8 (e.g. read_u8) to make Array[U8] have everything that a "byte string" class needs
- this will mitigate the usability issues that will result from bytewise operations disappearing from String.
encoder as a type parameter rather than a runtime parameter
- also make the method use iftype E <: UTF8StringEncoder instead of if encoder is UTF8StringEncoder
- these together make it so that there is zero cost for detecting the encoder at runtime - it is known at compile time.

Aug 18 '20 19:08 jemc

* encoder as a type parameter rather than a runtime parameter
  
  * also make the method use `iftype E <: UTF8StringEncoder` instead of `if encoder is UTF8StringEncoder`
  * these together make it so that there is zero cost for detecting the encoder at runtime - it is known at compile time.

Though, we should perhaps verify that there is a runtime cost being paid here without the generics. I suspect given the size and structure of the method that LLVM would have an easy time eliminating it.

Aug 18 '20 20:08 jasoncarr0

Hi Joe and all,

I have added comments below.

On Tue, Aug 18, 2020 at 3:40 PM Joe Eli McIlvain [email protected] wrote:

Some notes from the post-sync-call on suggestions to improve the performance and flexibility of various use cases, based on some limited review of Rowland's branch so far:

add embed _array: Array[U8] as the field within String, instead of the three fields we have now that happen to be the same as Array[U8]

this makes it so that string to array conversions can be zero allocation - every string already contains an array that it can share as a box/val. See the next bullet point for more info;

[RS] I have no objection to doing this in principal. It would be a lot of work, and is not related to the changes that I am proposing. Does anyone know why String is not implemented this way currently? I am just concerned that there might be some subtle reason why it won't work.

fun val array: Array[U8] val become fun array: this->Array[U8] box

it is a superset of what we have now, and allows a readable "byte string" reference to a String ref, rather than requiring val.

[RS] OK. Again, this is just an improvement to String that could be implemented independently of this RFC, right?

add whatever methods we need to add Array[U8] that are restricted by A:U8 (e.g. read_u8) to make Array[U8] have everything that a "byte string" class needs

this will mitigate the usability issues that will result from bytewise operations disappearing from String.

[RS] I like this idea. This way, it would not be necessary to convert a byte array to a String just to get access to some String functions. I will look at both classes and figure out what is missing.

encoder as a type parameter rather than a runtime parameter

also make the method use iftype E <: UTF8StringEncoder instead of if encoder is UTF8StringEncoder

these together make it so that there is zero cost for detecting the encoder at runtime - it is known at compile time.

[RS] I don't understand this. Can you provide a more complete example?

—

You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ponylang/rfcs/pull/179#issuecomment-675674614, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEQLVZFBZXHVKO3JOFVYZTSBLKKBANCNFSM4PS4AMOQ .

-- Rowland E. Smith P: (862) 260-4163 M: (201) 396-3842

Aug 20 '20 01:08 rowland66

@rowland66 The reasons that these changes are being mentioned now is because of the change to byte-wise API in the String. Efficient access to the underlying array and byte array methods is one way to allow us to access these lost APIs more effectively, without compromising on performance. It is true that I certainly don't know why the existing String has not been implemented that way.

Hence why these suggestions are intended to be part of this RFC (but note that implementation details such as the embed could be implemented in multiple PRs). Anyone else feel free to chime in if I'm not understanding correctly.

Aug 20 '20 01:08 jasoncarr0

Hi Agares and all,

Thank you for the feedback. I have added comments below, and will make changes to the RFC document.

On Tue, Oct 20, 2020 at 8:25 AM agares [email protected] wrote:

@Agares commented on this pull request.

In text/0000-unicode-string.md https://github.com/ponylang/rfcs/pull/179#discussion_r508448096:

+- Feature Name: unicode-string +- Start Date: 2020-07-30 +- RFC PR: (leave this empty) +- Pony Issue: (leave this empty)

+# Summary + +Change the builtin String class to present a sequence of unicode codepoints as 4 byte numbers (U32) instead of a sequence of bytes. All conversions from a sequence of bytes to a String, or from a String to a sequence of bytes will require specifying a decoding or encoding respectively. The default encoding will be UTF-8, and Strings will be represented internally as a sequence of UTF-8 encoded codepoints. Provide encoders/ decoders for UTF-8, UTF-16, UTF-32, ASCII and ISO-8859-1 as part of the stdlib. Change the character literal to represent a single unicode codepoint and produce a UTF-32 codepoint value. + +# Motivation + +Unicode is an accepted standard for representing text in all languages used on planet earth. Modern programming languages should have first class support for unicode. + +# Detailed design + +Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32).

I think it would make sense to mention endianness here (which I assume will be the one of the machine for which the application is compiled)

For the provided encoders, I have UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE. As I understand it, UTF-8 has no concept of endianness. I will update the text above to indicate the additional provided encoders. Does this cover your concern?

In text/0000-unicode-string.md https://github.com/ponylang/rfcs/pull/179#discussion_r508454400:

+The ByteSeq type defined in std_stream.pony will be changed to remove String. + +type ByteSeq is (Array[U8] val) + +Many functions that accept ByteSeq as a parameter will be changed to accept (String | ByteSeq) as a parameter.

+A new StringIter interface will be added to std_stream.pony +``` +interface val StringIter

"""

An iterable collection of String box.

"""

fun values(): Iterator[this->String box] +```

+Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed.

I think it's worth clarifying here if only \r\n and \n will be considered line breaks or if support for unicode line breaks (like U+2028, U+2029) will be added

I will add this clarification. Only \r\n and \n will be considered line breaks. No change to what is done today. As per wikipedia https://en.wikipedia.org/wiki/Newline "Recognizing and using the newline codes greater than 0x7F (NEL, LS and PS) is not often done."

In text/0000-unicode-string.md https://github.com/ponylang/rfcs/pull/179#discussion_r508455329:

+Many functions that accept ByteSeq as a parameter will be changed to accept (String | ByteSeq) as a parameter.

+A new StringIter interface will be added to std_stream.pony +``` +interface val StringIter

"""

An iterable collection of String box.

"""

fun values(): Iterator[this->String box] +```

+Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed. + +Change Writer in buffered/writer.pony to accept StringEncoder parameters in the write() and writev() functions. + +Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class in files/file_lines.pony.

Since the Unicode standard does not define what a character is, I think it's worth claryfing if it means codepoints, grapheme clusters or

something else

Agreed. I will change the FileCharacters class to FileCodepoints.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ponylang/rfcs/pull/179#pullrequestreview-512637773, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEQLVYAKMNI5CLAKNDENVLSLV6UTANCNFSM4PS4AMOQ .

-- Rowland E. Smith P: (862) 260-4163 M: (201) 396-3842

Oct 21 '20 02:10 rowland66

I'd like to leave a note here that, after looking over the PR (which has not yet implemented this portion), I believe my suggested change for the capability of the array() method causes very confusing behavior ~~and unsafety in some edge cases~~ with the expected implementation.

This can be fixed by using an embedded array, or two separate methods with restricted types and necessary warnings about confusing behavior.

Dec 04 '20 20:12 jasoncarr0

We talked about this over Sync. The plan we went forward with is keeping the array method as val only. This will continue the behavior of not copying for UTF8Encoder

The low-level access can be provided by a method with the suggested type signature: fun current_byte_buffer(): this->Array[U8] box which always constructs the underlying byte array without regard to encoding. This will share the buffer with the String, but construct a new read-only Array object pointing to it.

The documentation for the method will indicate that the array will track writes until the String's buffer is reallocated due to length changes.

Dec 08 '20 19:12 jasoncarr0

Hi Jason,

I have implemented this with this function.

fun current_byte_buffer(): Array[U8] => Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc)

I did not understand what "this->Array[U8] box" would mean. As I understand it, "this->" mean return the reference capability of the receiver, but then why say box. I think that we want Array[U8] box, but since the default reference capability for a function is box, so the function will return a box also. This code seems to work. I tried it with a val, trn and ref String. If this makes sense, I will add some tests and commit the change.

Also, how does Pony garbage collection work for this type of thing. I seems that we now have 2 references to the same memory. How does the Pony runtime know than and not collect the underlying array bytes while they are still referenced?

On Tue, Dec 8, 2020 at 2:26 PM Jason Carr [email protected] wrote:

We talked about this over Sync. The plan we went forward with is keeping the array method as val only. This will continue the behavior of not copying for UTF8Encoder

The low-level access can be provided by a method with the suggested type signature: fun current_byte_buffer(): this->Array[U8] box which always constructs the underlying byte array without regard to encoding.

The documentation for the method will indicate that the array will track updates until the String's buffer is reallocated.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ponylang/rfcs/pull/179#issuecomment-740896954, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEQLVZMTE2BHJRQRVFE5ETSTZ4VRANCNFSM4PS4AMOQ .

-- Rowland E. Smith P: (862) 260-4163 M: (201) 396-3842

Dec 09 '20 04:12 rowland66

I did not understand what "this->Array[U8] box" would mean

this->Array[U8] box is valid syntax for the return type. What that means can be factored into this->(Array[U8] box) i.e. it's the type that the current receiver views box as. Specifically this means that if the receiver is box or ref, it is indeed just box. But if the receiver is val then it means val. This is still useful given the other array method as it doesn't copy for any encoding.

In this case we want to ensure regardless of the receiver that it is read-only (so as to not break the string invariants), but we can be a bit stronger for val than just using box

Also, how does Pony garbage collection work for this type of thing. I seems that we now have 2 references to the same memory. How does the Pony runtime know than and not collect the underlying array bytes while they are still referenced?

As I understand it, the allocation as a whole is still garbage collected (and it needs to be read by the garbage collector for general arrays) and deallocation doesn't occur until neither the String nor Array are pointing to it.

Dec 09 '20 04:12 jasoncarr0

Hi Jason,

I have been struggling with this with no success. I can write three separate functions and it compiles:

fun ref current_byte_buffer_ref(): Array[U8] ref => Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc)

fun box current_byte_buffer_box(): Array[U8] box => Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc)

fun val current_byte_buffer_val(): Array[U8] val => recover val Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc) end

I can see how current_byte_buffer_ref() and current_byte_buffer_box() could be combined as a single function that returned this->Array[U8] box. But I can't figure out how to handle the val case, and since a box function works with a ref, val and box receiver I need to handle all three. What am I missing?

On Tue, Dec 8, 2020 at 11:44 PM Jason Carr [email protected] wrote:

I did not understand what "this->Array[U8] box" would mean

this->Array[U8] box is valid syntax for the return type. What that means can be factored into this->(Array[U8] box) i.e. it's the type that the current receiver views box as. Specifically this means that if the receiver is box or ref, it is indeed just box. But if the receiver is val then it means val. This is still useful given the other array method as it doesn't copy for any encoding.

Also, how does Pony garbage collection work for this type of thing. I seems that we now have 2 references to the same memory. How does the Pony runtime know than and not collect the underlying array bytes while they are still referenced?

As I understand it, the allocation as a whole is still garbage collected (and it needs to be read by the garbage collector for general arrays) and deallocation doesn't occur until neither the String nor Array are pointing to it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ponylang/rfcs/pull/179#issuecomment-741523322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEQLV2T5RF2IYXVHNMWFUDST36ENANCNFSM4PS4AMOQ .

-- Rowland E. Smith P: (862) 260-4163 M: (201) 396-3842

Dec 11 '20 04:12 rowland66

In this case you can use a bare recover without specifying a capability. I think this is a bit of a bug that complex capabilities cannot be written as the destination cap of a recover block (although it's only needed because of pointers here rather than actual fields). That is:

  fun current_byte_buffer(): this->Array[U8] box =>
    recover Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc) end

Strictly speaking you should also be able to recover to val and cover it, but it's not actually "correct" for most of the cases

Dec 11 '20 04:12 jasoncarr0

Finally got my ponyc env working again. The field access checks for recover blocks are still a tiny bit too strict, they should drop down to tag instead of yelling in this case, so you actually need to do:

  fun current_byte_buffer(): this->Array[U8] box =>
    let ptr = _ptr
    recover Array[U8].from_cpointer(ptr._unsafe(), _size, _alloc) end

But this code builds for me

Dec 11 '20 08:12 jasoncarr0

Hello, just realized this PR existed and am going to carry my feedback over from the issue I originally commented on. I provided a quick overview of some differences between what was suggested there and what is done in Rust and the benefits of those design choices.

My understanding is that this RFC encodes all strings as UTF-8 byte arrays and only allows users to operate on them using their character indices. Assuming that is correct and all reads and mutations of strings are performed on the byte representation, then I am concerned there will be issues with offering efficient string operations.

Specifically because on an arbitrary UTF-8 byte sequence computing the byte index from the character index is O(n). To my knowledge the string representations used by existing languages tend to avoid this issue by either not presenting the user with character indexing (e.g. Rust, see my prior response) or by encoding strings so that indexing by character can be done in constant time (e.g. Python, see this).

What is the intended approach of this RFC and how does it avoid or mitigate this issue?

Feb 28 '21 00:02 esoterra

Hi Kyle,

Thank you for taking the time to look at this. I am the author of the RFC, and have worked on the implementation. As you mention, the biggest down side to using a variable size encoding for unicode is that index operations are slow, and the implementation of this RFC has no particularly good solution. Pony does have an Array class that has many of the same functions as String, so for performance critical code, one option is to use Array's to hold String bytes. Also, some of the String functions take positive and negative index values. Negative indexes are relative to the end of the String, so in some cases index access using a negative index would be much faster.

In how different programming languages deal with Strings seems to be a tradeoff between speed, efficiency and correctness. As I understand it, Rust strings are UTF-8 encoded internally, but there are functions to access the underlying bytes, but they are prone to panics if used incorrectly. So Rust seems to be leaning toward correctness, but provides a loaded gun in case you want to take it and shoot yourself. Phython seems to be favoring efficiency and speed with the ability to choose between 1, 2 and 4 bytes per character strings. I assume that adding certain unicode characters to a 1 or 2 byte per character string results in some type of runtime error. So it is easier to write programs that fail at runtime unless you use 4 character byte strings, but that is pretty inefficient in most cases. The Pony implemenation in this RFC favors correctness (number 1 on the list in the Pony Philosophy).

In the RFC design, we have tried to leave room for alternate String implementations that might be better suited for some situations. The new String design also tries to remain as comaptible as possible with the existing Pony String class. The existing Pony String has a mix of functions some of which assume 1 byte per character and others that support a UTF-8 encoding. This was a confusing API that left a lot of room for programming error.

Hope this makes sense, and I am happy to discuss further if you have some ideas for improvement.

Mar 01 '21 00:03 rowland66

Hi Rowland,

Thanks for getting back to me quickly, I have a few specific comments below about things you mentioned and have tried to better outline which Rust features would best inform a Pony Unicode String implementation. Hopefully this will help clarify why I think Rust and Python act as archetypes for their chosen string strategy and why I think they are a good reference for designing a new string API.

RE: Negative Indices

That is an interesting point about negative indices. If reverse unicode segmentation is used, you can potentially traverse fewer bytes.

RE: Rust

they are prone to panics if used incorrectly

This is technically true of the Rust design, but there are panic-free options available and the ones that can panic are meant to be used with something like char_indices which ensure that you are at a valid boundary. The idiomatic ways of using Rust's String API don't meaningfully sacrifice correctness.

Additionally, the fact that it does allow panic using &a[i..j] is not a feature I am recommending Pony replicate. The rest of the API and its tradeoffs can exist entirely independent of it because a.get(i..j) provides the same functionality.

RE: Python

I assume that adding certain unicode characters to a 1 or 2 byte per character string results in some type of runtime error

This is not correct because strings in python are immutable and you cannot create such a scenario. Python can more correctly be said to emphasize correctness over speed/efficiency in the same way that concatenating Java, JavaScript, or C# strings directly does.

i.e. Python is in the "shared immutable strings" family and is optimized for that case

What would it look like to "follow Rust's example"?

Where Python offers shared immutable strings and is a good example for how to do so, Rust offers owned mutable append-only strings with an efficient and clean (ignoring the aforementioned) API.

In Pony terms, as I understand them, implementing an approach similar to Rust's would involve the following characteristics.

Internally represent strings as UTF-8
Represent characters using a 4-byte wide type
Make strings character iterable
Make strings (character, byte-index) iterable
Support byte-index based reading operations using partial functions which check whether the indices are valid (this check is quite inexpensive)
Treat the string length as its byte length
Require users to use the char and char_indices-like iterators for cases where they want to think in terms of characters, explicitly taking on the cost of iterating over the string.

Mar 01 '21 17:03 esoterra

@Kylebrown9 - if I understand your proposal correctly, it sounds like you are recommending something like: Keep byte-based operations as the first-class operations, but also have character-based operations.

This seems similar to what we have in Pony today (perhaps with the differences of: preference for UTF8 rather than UTF32 and perhaps with support for more character operations), so I'd be curious to hear you compare your suggested approach to Pony's current approach, rather than the approach proposed in this PR.

Mar 02 '21 19:03 jemc

In Rust, making a string from bytes is effectively partial because not all byte sequences are valid UTF-8 strings. Additionally, operations on strings preserve validity of the string as UTF-8, if an operation would violate this it instead fails. This means that all Rust strings can be assumed to be valid UTF-8 and as I understand it is not something Pony guarantees. Python actually makes a similar guarantee in a different way by keeping bytes and str distinct, converting between them can fail and operations that make new strings guarantee unicode validity.

As a change to Pony today

Making a string from a sequence of U8 would become partial, only succeeding when the resulting string would be valid utf-8.
Operations that use byte indices (e.g. apply, trim, chop, delete, substring) would become partial, failing if they index somewhere that is not a valid codepoint boundary.
Operations that read a U8 as a character (e.g. apply, pop, shift) could either
- be made explicitly byte-specific where possible (e.g. pop_byte),
- be converted to versions that produce U32 or a specific character type,
- or be removed.
Operations that provide a U8 as a character (e.g. update, unshift, push) could either
- be made explicitly byte-specific where possible (e.g. ushift_byte) although I wouldn't recommend this most of the time,
- be converted to versions that take in a U32 or special character type,
- or be removed.
The runes method would be kept, although I'd be tempted to call it chars, and another iterator would be added that also tells you the byte-index of the current character.

Edit: If this was going to follow the Rust approach, then a different string type(s) could/would be added to handle platform specific strings (see OsString) which may not satisfy these invariants or have other requirements.

Mar 02 '21 20:03 esoterra

Also, I realize I've ended up proposing and defending a specific prescription for how to do strings in Pony. That wasn't really my intention on arrival, I just ended up doing so in clarifying that I think it's important to offer a string API with always-valid unicode strings that offers indexing which matches the stored representation.

The language comparisons and proposal outlines I've made are just intended to demonstrate what I mean by those statements and ways they can be achieved.

Mar 02 '21 20:03 esoterra

For byte reading in this RFC, we have the current_byte_buffer method (EDIT: actually it still seems to be called array here). Do you think that's clear enough for you to find? And does it make sense to you to have byte-reading but not byte-writing?

Mar 02 '21 20:03 jasoncarr0

The only two things I would ask of a byte-based method on a string is that

it has byte or bytes in the name
and it cannot cause the string to become invalid.

There is an argument to be made that some methods might confuse users or encourage them to ignore the fact that a string is made of characters and not bytes, and in those cases it may make sense to not include them. Each method and name should be treated on a case-by-case basis and I don't necessarily have strong views on what the answer is because I don't know all of the existing idioms and use cases out there that they may impact.

Note: I don't consider a method to be "byte-based" if it does not take in or return U8 values, even if it uses byte-indices. i.e. substring and chop would not be byte-based as long as they return strings and not bytes

Edit: Forgot to directly answer your question. What you described meets the requirements I would set for it and sounds fine but others may have opinions on those choices.

Mar 02 '21 20:03 esoterra

Hi Kyle,

There is a lot to digest here. Let me start with your list from above of String features in Rust and how the compare to Pony post adoption of this RFC.

Internally represent strings as UTF-8

Same

Represent characters using a 4-byte wide type

Same. Pony has no Character type, but String functions return unicode codepoints as U32

Make strings character iterable

Same

Make strings (character, byte-index) iterable

No. As mentioned above, String has an array function that is very efficient assuming you want the String bytes UTF-8 encoded. However, the resulting Array is only safe to use if you know that your String contains only ASCII characters. The resulting Array is also a val, so it cannot be modified. Since the Arrays shares the memory used by the String, modifying the Array might corrupt the String.

Support byte-index based reading operations using partial functions which check whether the indices are valid (this check is quite inexpensive)

This would add a bunch of partial functions to String which in undesireable. Byte index values are never safe to change (increment or decrement), so their value is limited to remembing a position in a String. They could be useful if you know that your String is ASCII, but then you can use the array function and work with the String as an array of bytes. There is no right or wrong answer here. Its a decision on whether adding functions that are likely to be misused resulting in runtime errors is worth some added utility provided by correct use of the functions.

Treat the string length as its byte length

There is already a byte_size() function that serves this purpose. Knowing a String size in codepoints is more often what is needed, so I think that the size() behavior is correct.

Require users to use the char and char_indices-like iterators for cases where they want to think in terms of characters, explicitly taking on the cost of iterating over the string.

I think that it makes sense that String be a sequence of Characters, and Characters are no longer single bytes. For applications that want to work with bytes, Pony has the Array class. I think that keeping String's and Arrays of bytes separate and distinct results in less error prone code and I value that highly.

Something worth considering is whether Pony should support alternate internal presentations of String data. The intent is to leave that possibility open for the future.

Another thing to consider is optimizations to the current String class. For example, a String could maintain a boolean indicator of whether it contains any multi-byte encoded characters. In cases where it does not, character index based operations could be much faster. This does add additional data to every String instance, so such a feature would need to be carefully considered.

Mar 03 '21 04:03 rowland66

Hi Rowland,

I think so far it sounds like we're both in agreement about my priority A that Pony should "offer a string API with always-valid unicode strings".

It seems like we disagree about priority B, that it should use "indexing which matches the stored representation". My motivation for priority B could be rephrased as "indexing operations should be constant time" and I think it is born out by (1) the principle of least surprise and (2) the goal of performance.

1. Principle of Least Surprise

Popular programming languages usually offer constant time indexing for strings and I believe users now expect this. When users write for-loops that iterate over the index range or attempt to perform a series of random accesses because they don't realize this choice was made, they will encounter a performance cost that vastly exceeds their expectation, ergo surprise.

2. Performance

I think there are also other use cases where code that could have worked well on the existing string design or other options we have now will be meaningfully less performant. e.g. a compiler which tokenizes a string (which may not be all ascii) and then refers to spans of the string using character indices throughout the rest of the compiler. Whenever they want to read that span, they will take on an extra linear time cost and "remembering" the byte-index they're actually referring to is actually quite useful.

Mar 03 '21 15:03 esoterra

rfcs
rfcs copied to clipboard

Unicode String

@Agares commented on this pull request.

RE: Negative Indices

RE: Rust

RE: Python

What would it look like to "follow Rust's example"?

As a change to Pony today

1. Principle of Least Surprise

2. Performance

rfcs rfcs copied to clipboard

Unicode String

@Agares commented on this pull request.

RE: Negative Indices

RE: Rust

RE: Python

What would it look like to "follow Rust's example"?

As a change to Pony today

1. Principle of Least Surprise

2. Performance

rfcs
rfcs copied to clipboard