Odin icon indicating copy to clipboard operation
Odin copied to clipboard

Add bytes.to_lower and bytes.to_upper

Open laytan opened this issue 2 years ago • 5 comments

laytan avatar Mar 29 '23 17:03 laytan

the strings package may already have this implemented

jon-lipstate avatar Apr 07 '23 20:04 jon-lipstate

Any particular motivation, @laytan?

  1. I don't know whether it makes sense for the bytes package to have case conversion. The package doesn't assume any particular encoding and the strings package already provides this.
foo := []u8{65, 66, 67} // ABC
bar := strings.to_lower(string(foo))
  1. Does the Unicode encoding guarantee that upper- and lowercase versions of a glyph encode to the same length? If it can't guarantee that (including for future glyphs), then encoding in place like this isn't safe as the result may expand.

I'm curious what @gingerBill thinks, but I'm reluctant to add this change considering point 1 especially.

Kelimion avatar Apr 08 '23 07:04 Kelimion

Most of core:bytes is a 1:1 to core:strings. Maybe we should remove all the duplicated procedures?

Lperlind avatar Apr 08 '23 07:04 Lperlind

Most of core:bytes is a 1:1 to core:strings. Maybe we should remove all the duplicated procedures?

There is a case to be made to have both, and they could have subtly different behaviour based on strings having an encoding and bytes not assuming one.

Kelimion avatar Apr 08 '23 07:04 Kelimion

Motivation was that this does not need conversion to string and then back to bytes if you want bytes in, and bytes out. Also makes it done in place while strings package allocates.

The bytes package has a couple of procedures that use runes/unicode already so I found it fit, and that it is already in the strings package is the case for most procedures in the bytes package.

As for your 2nd point @Kelimion, upon further investigation, unicode has a couple of cases where the bytes length changes between case, reference.

I don't think the unicode package implements these cases though (unimplemented or bug, idk), I wrote this small script to verify:

package main

import "core:fmt"
import "core:unicode"
import "core:unicode/utf8"

chars :: []string{
        "ǰ", // Latin Small Letter J with Caron
        "ff", // Latin Small Ligature Ff
        "ῗ", // Greek Small Letter Iota with Dialytika and persispomeni
}

main :: proc() {
        for tc in chars {
                ch, ch_size := utf8.decode_rune(tc)
                ch_bytes, _ := utf8.encode_rune(ch)
                fmt.printf("ch: %v\n", ch)
                fmt.printf("ch_bytes: %v\n", ch_bytes)
                fmt.printf("ch_size: %v\n", ch_size)

                upper_ch := unicode.to_upper(ch)
                fmt.printf("upper_ch: %v\n", upper_ch)
                upper_bytes, upper_size := utf8.encode_rune(upper_ch)
                fmt.printf("upper_bytes: %v\n", upper_bytes)
                fmt.printf("upper_size: %v\n", upper_size)
                fmt.println()
        }
}

And all these characters come out the exact same character they came in, while above document says otherwise.

laytan avatar Apr 09 '23 15:04 laytan

closing stale pr

laytan avatar Oct 09 '23 15:10 laytan