book icon indicating copy to clipboard operation
book copied to clipboard

Suggestion: Section 4.3 `first_word` example could be updated to use simpler, UTF-8-friendly iteration.

Open JennDahm opened this issue 3 years ago • 2 comments

The first_word() example in Section 4.3 currently iterates over the String's individual bytes, rather than Unicode characters:

fn first_word(s: &String) -> &str {
    let bytes = s.as_bytes();

    for (i, &item) in bytes.iter().enumerate() {
        if item == b' ' {
            return &s[0..i];
        }
    }

    &s[..]
}

This happens to work because the ASCII space character (or any ASCII character, for that matter) it's looking for can't appear in the middle of a multi-byte UTF-8 character sequence, but it seems like bad practice regardless, and it's one of a new Rust developer's first exposures to String handling in the book. Instead, the example could be made more UTF-8 friendly and simpler by using String::char_indices():

fn first_word(s: &String) -> &str {
    // String slices are based on byte offsets, but require you to slice at valid UTF-8
    // character boundaries. We can solve this with String::char_indices(), which gives
    // us the real UTF-8 byte offset with each unicode character, allowing us to slice
    // naturally while still handling UTF-8 character boundaries safely.
    for (idx, c) in s.char_indices() {
        if c == ' ' {
            return &s[..idx];
        }
    }
    return &s[..];
}

This also works with the "return an index" version, since idx has the same semantics as i in the original example.

I understand that this isn't the chapter to get deep into Unicode handling, and there is certainly more to Unicode than individual "characters", but I don't think changing the example in this way would distract from the lessons about slicing or require too much (if any) additional explanation than the existing solution, and it has the benefit of encouraging new developers to use UTF-8-friendly functions from the start. It also guides new developers toward useful, more UTF-8-friendly char functions like is_whitespace().

Is this a reasonable update to slip into the next edition of the book, or do you think this cracks opens the Pandora's Box of Unicode a little too much?

JennDahm avatar Sep 13 '21 07:09 JennDahm

I like @JennDahm proposed solution, as it does not drive the attention away from the slice concept, and on the other hand does not sparkle bunch of questions of why one would iterate over an abstract collection of characters byte by byte.

skwasniak avatar Sep 22 '21 07:09 skwasniak

I wanted to add my 👍. This code snippet was actually disorienting because as soon as I saw the as_bytes, my mind immediately segfaulted with a "But you can't do that!" error. It actually disrupted my ability to process the rest of the chapter until I got to the note discussing UTF 8 boundaries.

iloving avatar Nov 26 '22 18:11 iloving