coreutils icon indicating copy to clipboard operation
coreutils copied to clipboard

`expr` is failing with multibyte chars

Open sylvestre opened this issue 3 years ago • 6 comments

It causes https://github.com/coreutils/coreutils/blob/master/tests/misc/expr-multibyte.pl to fail

$ ./target/debug/coreutils expr length αbcdef
7

GNU:

$ expr length αbcdef
6

needs to have a different locale compiled like

sudo locale-gen fr_FR.UTF-8

sylvestre avatar Feb 13 '22 09:02 sylvestre

Of course, it is about rust. See https://doc.rust-lang.org/book/ch08-02-strings.html#internal-representation

Simple testcase:

fn main() {
    let s = String::from("αbcdef");
    assert_eq!(s.len(), 6);
}

=>

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `7`,
 right: `6`', src/main.rs:3:5

sylvestre avatar Feb 13 '22 12:02 sylvestre

I did some extra testing to check whether we need unicode segmentation here and we don't. GNU expr outputs a length of 2 for this emoji:

[src/main.rs:4] "🇳🇱".len() = 8
[src/main.rs:5] "🇳🇱".chars().count() = 2
[src/main.rs:6] UnicodeSegmentation::graphemes("🇳🇱", true).count() = 1

Playground link

tertsdiepraam avatar Feb 13 '22 12:02 tertsdiepraam

Yeah, I am working on a fix :)

sylvestre avatar Feb 13 '22 12:02 sylvestre

To reproduce: bash util/run-gnu-test.sh tests/misc/expr-multibyte

sylvestre avatar Feb 13 '22 13:02 sylvestre

Actually, my patch was wrong, it should take in account the locale

$ LANG=C expr length αbcdef
7
$ LANG=fr_FR.UTF-8 expr length αbcdef
6

seems that we should use MB_CUR_MAX to see the number of bytes

sylvestre avatar Feb 14 '22 10:02 sylvestre

Hi all, is this still an issue?

Chris

chrisdebian avatar Feb 12 '25 19:02 chrisdebian

Closing this issue, it looks like the issue has been fixed in the meantime and the GNU test tests/expr/expr-multibyte.pl passes.

cakebaker avatar Nov 12 '25 10:11 cakebaker