coreutils icon indicating copy to clipboard operation
coreutils copied to clipboard

cksum: Accept non UTF-8 inputs

Open RenjiSann opened this issue 1 year ago • 3 comments

Fixes #6573

RenjiSann avatar Aug 01 '24 22:08 RenjiSann

GNU testsuite comparison:

Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

github-actions[bot] avatar Aug 02 '24 09:08 github-actions[bot]

@BenWiederhake Could you take a look ? :)

RenjiSann avatar Aug 17 '24 09:08 RenjiSann

@sylvestre all your comments are addressed in #6654. If that's still an issue, please tell me, I will refactor the code accordingly.

RenjiSann avatar Aug 26 '24 13:08 RenjiSann

The other PR doesn't address sylvestre's points either (e.g. the linux-only issue)

The test is linux-only because os_str_as_bytes will automatically fail if given non-UTF-8 characters on windows. This is due to the fact that there is no safe provided way to convert a OsStr to &[u8] on windows, and we are required to go through an intermediary &str conversion.

Improving the handling of windows filesystem exceptions this might be a further enhancement, but I don't think this should be the main point of this PR.

and seems to be a collection of many different PRs, which makes it unnecessarily difficult to review,

True, #6654 is a really big refactor, which solves many problems at once. I think about re-organizing all the changes in smaller PRs, while trying to keep the reasoning clear.

RenjiSann avatar Sep 17 '24 12:09 RenjiSann

So which PR would you like us to review?

Let's review and merge this one first, I eventually added the changes requested which I originally put in #6654. Next, I will work on splitting #6654 into reasonably-sized changes, and try to document things correctly

RenjiSann avatar Sep 17 '24 13:09 RenjiSann

As requested, I changed the code to handle non-UTF8 filenames, and I added a test for it.

Previously, I used String::from_utf8_lossy to display the filename on the terminal when needed. I realized this was wrong, because it actually inserts a REPLACEMENT CHARACTER U+FFFD for any non-UTF-8 sequence, when we actually want to omit them completely. At first, I tried to copy-paste the implementation of String::from_utf8_lossy and adapt it, but I figured out, because of unstabilized internals, that this could not be done without copy-pasting even more code from the stdlib.

The compromise I went with is to just remove the U+FFFD chars from the output of String::from_utf8_lossy. It might produce unwanted behavior in case the character appears in the original filename, but I consider this to be unlikely enough to wait to fix it when the involved String's API is stable (MSRV 1.79 iirc).

RenjiSann avatar Sep 17 '24 16:09 RenjiSann

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)

github-actions[bot] avatar Sep 17 '24 16:09 github-actions[bot]

GNU testsuite comparison:

Skipping an intermittent issue tests/rm/rm1 (passes in this run but fails in the 'main' branch)

github-actions[bot] avatar Sep 18 '24 13:09 github-actions[bot]

@BenWiederhake If you want to take a look, this should finally be mergeable

RenjiSann avatar Oct 11 '24 12:10 RenjiSann

GNU testsuite comparison:

GNU test failed: tests/timeout/timeout. tests/timeout/timeout is passing on 'main'. Maybe you have to rebase?

github-actions[bot] avatar Oct 11 '24 12:10 github-actions[bot]

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)
GNU test failed: tests/timeout/timeout. tests/timeout/timeout is passing on 'main'. Maybe you have to rebase?

github-actions[bot] avatar Oct 15 '24 09:10 github-actions[bot]

I closing this PR in favor of these 2 new ones :

  • comments handling (#6782)
  • non-UTF-8 handling (#6793)

RenjiSann avatar Oct 18 '24 23:10 RenjiSann