kotlinx-io icon indicating copy to clipboard operation
kotlinx-io copied to clipboard

Provide String to ByteString conversion using ASCII encoding

Open SPC-code opened this issue 1 year ago • 6 comments

In protocol parsing/writing we frequently need to operate with one-byte encoded strings that a expected to consist only of ASCII characters. Please add an ability to convert a string literal to a ByteString using character-to-byte transformation with check for non-ASCII characters.

SPC-code avatar Jul 10 '23 06:07 SPC-code

@SPC-code would something like these work for you?

fun String.encodeToAsciiByteString(): ByteString {
    val bstr = this.encodeToByteString()
    if (bstr.size != length) throw IllegalArgumentException("String is not an ASCII string: $this")
    return bstr
}
fun String.encodeToAsciiByteString(): ByteString {
    return buildByteString(length) {
        [email protected] { 
            if (it.code > Byte.MAX_VALUE || it.code < Byte.MIN_VALUE) {
                throw IllegalArgumentException("Character could not be encoded using ASCII: $it")
            }
            append(it.code.toByte())
        }
    }
}

fzhinkin avatar Jul 10 '23 09:07 fzhinkin

I've done it differently: https://github.com/SciProgCentre/dataforge-core/blob/2aba1b48dce011906231ba5ab67353f9901cadfa/dataforge-io/src/commonMain/kotlin/space/kscience/dataforge/io/ioMisc.kt#L12-L19

But the important thing to have this API. Implementation could change in future.

SPC-code avatar Jul 10 '23 09:07 SPC-code

Plus an option for extended ASCII would be good to have.

lppedd avatar Oct 13 '23 10:10 lppedd

@lppedd could you please elaborate what do you mean under "extended ASCII"?

fzhinkin avatar Oct 13 '23 15:10 fzhinkin

@fzhinkin I meant the standard ASCII + the other 128 code points.
But I forgot that the extended part (the additional 128) is not standard, although maybe the general consensus is on the Windows-1252 or ISO 8859-1 charsets.

lppedd avatar Oct 13 '23 15:10 lppedd

I believe that such scenarios require explicit encoding routine that will use Windows-1252 or some other 8-bit encoding. Silently falling back to some default charset encoding is not a great option as it allows to encode potentially incorrect data without noticing a problem. And at the moment there are no particular plans on supporting charset encodings other then UTF-8.

fzhinkin avatar Oct 13 '23 16:10 fzhinkin