kotlinx-io
kotlinx-io copied to clipboard
Make `String.utf8Size()` public
I am trying to migrate Krossbow, my STOMP-over-websocket multiplatform library, and I stumbled upon this function becoming private/internal in kotlinx-io
.
I relied on it because I need to truncate a string based on its length in UTF-8 bytes (not characters). The use case is the "close reason" (the text payload) of the web socket Close
frame. As per the specification:
All control frames MUST have a payload length of 125 bytes or less and MUST NOT be fragmented.
With the utf8Size
function, I could optimize the happy path by not copying bytes around if the string was short enough. Without this function, I need to encode the string into a byte array uncondtionally, then check if it's too long, in which case I need to truncate.
I think there are probably more use cases where this is useful information to have without actually copying the bytes.
Alternatively, this method could be directly provided by kotlinx-io
instead:
/**
* Returns the biggest prefix of this string that can be represented with at most [maxLength] UTF-8 bytes.
*
* If the truncation occurs in the middle of a multibyte-character, the whole character is removed.
*/
fun String.truncateUtf8BytesLengthTo(maxLength: Int): String {
if (utf8Size() <= maxLength) {
return this
}
return encodeToByteArray().copyOf(maxLength).decodeToString().trimEnd { it == '\uFFFD' }
}
But that might be a bit too specific to be in the library (I will leave that decision to you of couse).
On non-JVM targets a value returned by utf8Size
may not be the same as encodeToByteArray().size
:
println("🌞".substring(1).utf8Size()) // 1 on all targets
println("🌞".substring(1).encodeToByteArray().size) // 1 on JVM, 3 on all other targets
For invalid UTF-16 characters (like in the example above, where a string consist of only the low surrogate), kotlinx-io
always uses ?
character as a replacement (the same way JVM does).
However, the proper replacement should be U+FFFD
(it requires 3 code units to encode) and Kotlin Stdlib uses it on all platforms other than JVM, and on JVM for compatibility reasons it also uses ?
.
There are a few options here:
- (somehow) confine
utf8Size
to theSink
s scope, to make everyone know that the returned value makes sense only forSink.writeString
; (probably) mark it asDelicateIoApi
, write a comprehensive doc describing what's wrong with it; - align
writeString
andutf8Size
behavior with Kotlin Stdlib.
Ugh! I didn't realize encodeToByteArray()
worked this way for incorrect standalone surrogates on JVM 🤯
Little experiments for those interested: https://pl.kotl.in/Yg3GwVJEc.
Kotlin Stdlib uses it
Am I right that we're talking specifically about String.encodeToByteArray()
here?
and on JVM for compatibility reasons it also uses ?.
For compatibility with what exactly? This is a Kotlin API, there is no such function in Java's strings. I find it quite unfortunate that we can't have proper multiplatform tests for this kind of things due to a deliberate inconsistency.
Or do you mean because encoreToByteArray
was JVM-only for a time, and it was wrong at that time, and now we can't go back?
For invalid UTF-16 characters (like in the example above, where a string consist of only the low surrogate), kotlinx-io always uses ? character as a replacement (the same way JVM does).
What do you mean by "always uses ? character as a replacement"? Do you mean in writeString
, regardless of the platform? It seems strange to use the ?
character when writing to sinks to be frank, because there is no way to distinguish this replacement from an actual '?'
character in the source string.
I don't think there is any compatibility guarantees kotlinx-io
needs to respect for now, given that it's a 0.x
, right? Even more so because the string itself is invalid in this state in the first place (heck we could even crash here! 😄).
So IMO the proper way to go would be to change writeString
itself to use replacement characters for invalid surrogates like this on all platforms, and that utf8Size()
takes this into account and returns 3 on all platforms. Or just make both functions fail on invalid UTF-16 char sequences 🤷
align writeString and utf8Size behavior with Kotlin Stdlib
If we change writeString
and utf8Size
, I believe we may as well make them behave more correctly than encodeToByteArray
. I don't think kotlinx-io
needs to be consistent with the stdlib's inconsistency.
For compatibility with what exactly?
With how Java encodes strings are being encoded to UTF-8.
What do you mean by "always uses ? character as a replacement"? Do you mean in
writeString
, regardless of the platform?
Exactly, writeString
unconditionally emits ?
as a replacement character: https://github.com/Kotlin/kotlinx-io/blob/c2d52202984f610054f21fd77def42e6f554bb70/core/common/src/Utf8.kt#L476
I don't think there is any compatibility guarantees
kotlinx-io
needs to respect for now, given that it's a0.x
, right?
0.x
gives us the freedom to a certain extent, but it would be nice to behave consistently with Kotlin Stdlib as some parts of kotlinx-io
may end up there eventually.
So IMO the proper way to go would be to change
writeString
itself to use replacement characters for invalid surrogates like this on all platforms, and thatutf8Size()
takes this into account and returns 3 on all platforms.
Your example showcased that even though utf8Size()
is kotlinx-io
-specific, it's quite tempting to use it in conjunction with String::encodeToByteArray
. So, it's better to return a value that'll be correct for both Stdlib and kotlinx-io
string conversions.
Or just make both functions fail on invalid UTF-16 char sequences 🤷
Then writeString
should fail too (which will be quite unfortunate), as being able to write an ill-formed string but not being able to query its size beforehand seems a bit off.
With how Java encodes strings are being encoded to UTF-8.
Ah I see, so rather consistency with Java APIs, but not compatibility in the sense that it would break things if the stdlib had done it differently. To be frank I don't think similar behavior with Java APIs is necessarily something to aim for in cases where the behavior is debatable, but that's just a personal opinion.
Your example showcased that even though utf8Size() is kotlinx-io-specific, it's quite tempting to use it in conjunction with String::encodeToByteArray.
Fair enough. Though this function (like most) assumes a well-formed input string. In that case utf8Size()
and encodeToByteArray()
behave consistently. The fun part happens when truncating, and decodeToString
actually does decode invalid bytes as a replacement char. This is why this implementation is valid for valid strings.
it's better to return a value that'll be correct for both Stdlib and kotlinx-io string conversions
I think the best would be to fix the stdlib's function then :) But even without this, I still believe it's worth being inconsistent with the stdlib if it's more correct to behave this way. Especially in the context of KMP, having different behavior on different platforms is really undesirable.
Or just make both functions fail on invalid UTF-16 char sequences
Then writeString should fail too
Well.. I did write "both functions" 😄 I would be quite ok with that option actually. Not being able to use writeString()
with an invalid string is probably acceptable. You can still write any byte you want using other functions.