BLAKE3 icon indicating copy to clipboard operation
BLAKE3 copied to clipboard

b3sum can't read hashes from file with non-unix endings

Open megapro17 opened this issue 3 years ago • 5 comments

I'm using powershell

b3sum * > hash.txt
b3sum -c hash.txt

Output for each file is: : FAILED (Syntax error in file name, folder name or volume label. (Os error 123))

But after running busybox dos2unix hash.txt it works correctly

Similar https://github.com/BLAKE3-team/BLAKE3/issues/108

megapro17 avatar Jan 19 '22 15:01 megapro17

I'm not very familiar with PS, but I can repro this bug on my Windows box. It looks like it's not (only) a line endings issue, but actually a Unicode encoding issue, UTF-8 vs UTF-16. Here's how you can see it:

# I've prepared a "test" directory with two files, "a" and "b"
PS C:\Users\oconn\tmp> b3sum test\*
5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/a
5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/b

# We can see that b3sum's output is UTF-8, using Python.
PS C:\Users\oconn\tmp> python
Python 3.10.1 (tags/v3.10.1:2cd268a, Dec  6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> subprocess.run("b3sum test\*", shell=True, stdout=subprocess.PIPE).stdout
b'5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/a\n5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/b\n'
>>> _.decode("utf-8")
'5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/a\n5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/b\n'

# However, that's not what we get if we redirect the output in PowerShell.
# It looks like PowerShell reencodes the output as UTF-16.
PS C:\Users\oconn\tmp> b3sum test\* > out.txt
PS C:\Users\oconn\tmp> python
Python 3.10.1 (tags/v3.10.1:2cd268a, Dec  6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> open("out.txt", "rb").read()
b'\xff\xfe5\x00d\x003\x00b\x004\x001\x001\x004\x003\x00c\x00f\x007\x003\x00b\x005\x005\x00e\x00c\x00f\x004\x00d\x009\x000\x00f\x002\x00c\x00e\x000\x00b\x00f\x001\x00d\x003\x00f\x008\x00e\x004\x00b\x003\x000\x005\x00d\x004\x005\x00d\x009\x00f\x009\x003\x006\x001\x00a\x007\x004\x006\x00e\x00b\x000\x009\x000\x004\x000\x00f\x000\x00 \x00 \x00t\x00e\x00s\x00t\x00/\x00a\x00\r\x00\n\x005\x00d\x003\x00b\x004\x001\x001\x004\x003\x00c\x00f\x007\x003\x00b\x005\x005\x00e\x00c\x00f\x004\x00d\x009\x000\x00f\x002\x00c\x00e\x000\x00b\x00f\x001\x00d\x003\x00f\x008\x00e\x004\x00b\x003\x000\x005\x00d\x004\x005\x00d\x009\x00f\x009\x003\x006\x001\x00a\x007\x004\x006\x00e\x00b\x000\x009\x000\x004\x000\x00f\x000\x00 \x00 \x00t\x00e\x00s\x00t\x00/\x00b\x00\r\x00\n\x00'
>>> _.decode("utf-16")
'5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/a\r\n5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/b\r\n'

We can see at the end there that there are Windows-style newlines, and I expect those would cause a problem in b3sum. But we never get to that problem, because we try to decode the checkfile as UTF-8 and fail immediately.

I'm not sure what the best workaround is for this. Is there a way to tell PowerShell to redirect raw bytes?

oconnor663 avatar Jan 19 '22 20:01 oconnor663

Which version of powershell you're using? I'm running the latest PowerShell 7.2.1 and seems it outputs UTF-8 correctly:

b3sum *.txt > sum.hash
b3sum -c sum.hash
: FAILED (Syntax error in file name, folder name or volume label. (Os error 123))
: FAILED (Syntax error in file name, folder name or volume label. (Os error 123))
python
Python 3.9.10 (tags/v3.9.10:f2f3f53, Jan 17 2022, 15:14:21) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> open("sum.hash", "rb").read()
b'd63bd9a826af91c1fea371965a64e11ee20f13e46b5f52c59901136605b3a487  1.txt\r\n813e9b729141e7f385afa0a2d0df3e6c3789e427ffe4aeef566a565bc8f2fe3d  2.txt\r\n'
>>> _.decode("utf-16")
'㙤戳㥤㡡㘲晡ㄹㅣ敦㍡ㄷ㘹愵㐶ㅥ攱㉥昰㌱㑥戶昵㈵㕣㤹\u3130㌱㘶㔰㍢㑡㜸†⸱硴൴㠊㌱㥥㝢㤲㐱攱昷㠳愵慦愰搲搰㍦㙥㍣㠷改㈴昷敦愴敥㕦㘶㕡㔶换昸昲㍥\u2064㈠琮瑸\u0a0d'
...
>>> _.decode("utf-8")
'd63bd9a826af91c1fea371965a64e11ee20f13e46b5f52c59901136605b3a487  1.txt\r\n813e9b729141e7f385afa0a2d0df3e6c3789e427ffe4aeef566a565bc8f2fe3d  2.txt\r\n'

But cmd with cmd it's outputting correct file:

 ERROR megapro17@megapro17-pc  R:  test  cmd
Microsoft Windows [Version 10.0.22000.466]
(c) Microsoft Corporation. All rights reserved.

R:\test>b3sum *.txt > sum.hash

R:\test>b3sum -c sum.hash
1.txt: OK
2.txt: OK

R:\test>python
Python 3.9.10 (tags/v3.9.10:f2f3f53, Jan 17 2022, 15:14:21) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> open("sum.hash", "rb").read()
b'd63bd9a826af91c1fea371965a64e11ee20f13e46b5f52c59901136605b3a487  1.txt\n813e9b729141e7f385afa0a2d0df3e6c3789e427ffe4aeef566a565bc8f2fe3d  2.txt\n'
>>>

I'm not sure what the best workaround is for this. Is there a way to tell PowerShell to redirect raw bytes?

Just adding ability to read any file endings. Because it's possible to create txt file from notepad, paste hashes here, and it will still not work

megapro17 avatar Jan 23 '22 22:01 megapro17

Just adding ability to read any file endings.

This is a reasonable idea, but I want to clarify that it's backwards-incompatible with what b3sum currently does. File names are allowed to contain the \r carriage return character, and b3sum (like md5sum) prints that character without escaping. Because of that, stripping a trailing \r\n from each line could potentially change the meaning of currently a valid checkfile. To do this properly, we'd probably need to start escaping \r just like we currently escape \n.

oconnor663 avatar Feb 26 '22 03:02 oconnor663

@oconnor663 ,

[...] the \r carriage return character, and b3sum (like md5sum) prints that character without escaping

On Debian 12, I find that GNU coreutils (md5sum/sha*sum/b2sum), do escape CR to "\r". Though I also find that busybox *sum and perl shasum, cannot verify those coreutils checksums.

I found a relevant commit here, with some reasoning about it, also involving Windows.

Anyway, it seemed relevant to mention here.

n8w8 avatar Sep 27 '23 12:09 n8w8

I would like to cross reference:

(1) blake3 / incompatibilities with cli (original) b3sum implementation - Total Commander https://ghisler.ch/board/viewtopic.php?t=80593

AnselmD avatar Nov 26 '23 12:11 AnselmD