rev_lines icon indicating copy to clipboard operation
rev_lines copied to clipboard

Issues with non-ASCII characters?

Open mgsloan opened this issue 6 years ago • 5 comments

Hey, I was in the process of copy+modifying this code to instead iterate backwards on all whitespace. However, I realized that this doesn't handle UTF-8 properly. Since UTF-8 is a variable length encoding, seeking by blocks of bytes can lead to getting a chunk that starts in the middle of a UTF-8 codepoint.

here, the code will panic if it encounters invalid UTF-8 (due to use of unwrap). Due to splitting by blocks, even valid UTF-8 could cause a panic. Generally it's not good to panic like that in a library.

I'm betting that \n and \r are valid bytes for parts of multi-byte UTF-8 codepoints, so newlines might also be incorrectly identified.

I think it would be good to either have it decode UTF-8 in reverse properly, or change it to only deal with ASCII and document that. Or, at least, add a note to the docs about this problem.

mgsloan avatar Aug 15 '19 03:08 mgsloan

Definitely think it would be good to properly handle UTF-8 and not cause panics. Feel free to open a PR with changes!

mjc-gh avatar Aug 15 '19 11:08 mjc-gh

I might end up implementing this, but not anytime very soon. It would probably be a new crate, because my usecase is breaking on something other than lines.

It might be good to note the issue in the crate's docs? Perhaps it's no big deal, seems like there are no crates that depend on it. However, I did find some GitHub projects that use this crate https://github.com/search?utf8=✓&q=rev_lines+filename%3ACargo.toml&type=Code&ref=advsearch&l=&l= https://github.com/search?utf8=%E2%9C%93&q=rev_lines+filename%3ACargo.toml&type=Code&ref=advsearch&l=&l= , that are probably not aware of these problems.

On Thu, Aug 15, 2019 at 5:32 AM Michael Coyne [email protected] wrote:

Definitely think it would be good to properly handle UTF-8 and not cause panics. Feel free to open a PR with changes!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mjc-gh/rev_lines/issues/3?email_source=notifications&email_token=AAAFQKRIZQTGJENNMLNMBILQEU5DPA5CNFSM4IL2XB7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4LSL6A#issuecomment-521610744, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAFQKQKSEEXODMKZKGRGFLQEU5DPANCNFSM4IL2XB7A .

mgsloan avatar Aug 16 '19 04:08 mgsloan

The unwrap is definitely an issue, because the code will crash on invalid utf8. However I'm not sure what's the proper way in rust to for an iterator to indicate failure when next() is invoked. One option would be stop reading that file.. Meaning,

- Some(String::from_utf8(result).unwrap())
+ String::from_utf8(result).ok()

Which will return a Some if the utf-8 decoding went well, and a None (stop iterating) otherwise.

But right now I think that this still cannot crash on valid utf-8. If I understand correctly the code searches for \r and \n boundaries. these are valid utf-8 code points by themselves, since they are valid ascii. And they can not appear as individual bytes of another utf-8 code point, because in utf-8, if a byte has 0 as the first bit (which is the case for all of ASCII), then it's a single byte code point. In all other cases (individual bytes of a multibyte code point) the first bit is 1.

Still, the invalid utf-8 issue is worth thinking I think. What do you think about returning None in these cases to stop the iteration?

emmanueltouzery avatar Feb 03 '20 18:02 emmanueltouzery

Returning None in this case and stopping iteration sounds reasonable to me. Happy to merge a PR if you want to make this change.

mjc-gh avatar Feb 06 '20 13:02 mjc-gh

Newlines wont be incorrectly identified inside multi byte characters as UTF-8 is backwards compatible with ASCII in such a way that multi byte codes always start with MSB set (all multi byte bytes are 128 - 255).

As for the error handling on invalid UTF-8, as long as the RevLines iterator is of type String, I think it would be reasonable to stop iterating if invalid UTF-8 is encountered. RevLines could be more reusable for other character encodings if it was implemented as an iterator over &[u8] instead, but then the newline character could possibly have a different LF_BYTE or CR_BYTE, and it would be ambiguous what to iterate over. It would be nice to include this in the documentation though.

magnusdr avatar Sep 17 '21 20:09 magnusdr

I don't like the idea of breaking early on non-utf8 content. The solution I saw in https://github.com/mjc-gh/rev_lines/pull/6 doesn't differentiate between file being over and some non-utf8 char in the middle of the file.

For my use case, I just want to skip such lines, but proceed with the rest of the file.

Here's my jab at a solution: https://github.com/mjc-gh/rev_lines/pull/8. For now, it leaves the current behavior of RevLines untouched, but introduces RevByteLines which yields Vec<u8> instead (which is basically free, as it needs to be created in next() either way) and lets the user decide what to do with it.

For example, in my case, I'll do something like:

let file = BufReader::new(Cursor::new(vec![
    b'G', b'H', b'I', b'J', b'K', b'\n',
    b'X', 252, 253, 254, b'Y', b'\n',
    b'A', b'B', b'C', b'D', b'E', b'F']));
let rev_byte_lines = RevByteLines::new(file)?;

for line in rev_byte_lines {
    // String::from_utf8_lossy could also work for some use cases
    match String::from_utf8(line) {
        Ok(line) => println!("{}", line),
        Err(e) => println!("Error: {}", e),
    }
}

tomkarw avatar Apr 27 '23 21:04 tomkarw

Better support for UTF8 has been merged in. I am going to fmt and update the library version and do a release. Closing this issue for now.

mjc-gh avatar May 10 '23 17:05 mjc-gh