zstd icon indicating copy to clipboard operation
zstd copied to clipboard

Add an option in the zstd CLI to verify that a given .zst file matches an uncompressed file

Open FRex opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe. I'd like a switch or two in the CLI to verify that a given file matches the compressed .zst file hash and/or content, to verify if given .zst file is a compressed version of a given normal file.

Describe the solution you'd like A new switch or two, that'd look and work as such:

$ zstd --verify somefile.txt.zst somefile.txt
somefile.txt.zst and somefile.txt have matching xxhash.

$ zstd --verify-data somefile.zst somefile.txt
somefile.txt.zst and somefile.txt have matching data.

It'd first check the filesize, and then hash/data (no point doing the latter if filesize doesn't match).

Describe alternatives you've considered I've considered writing own C or Python program to do this, but I think it'd fit as part of zstd CLI and be useful in general. Zstd CLI also already has all the functionality: file IO, parsing zstd frames, xxhash, etc. Also zstd -l does display the frame count, sizes (human readable, not down to bytes), and that xxhash was used, but does not tell me the 4 low bytes of the 64-bit xxhash so I can't use that with xxhash myself either.

Additional context My use case is that I often work with big text files that I get as .zst, and sometimes I modify them. When I need to free up some space I go delete some of the unmodified files, but wouldn't want to delete a modified one. This option would let me check if given file and the same file + .zst are 'same', and if it's safe to delete the uncompressed one or not.

Another use case could be someone who is paranoid and wants to verify that, maybe it could be part of some extra --rm option for very careful people too (I don't know if --rm now verifies the written file is correct or not).

FRex avatar Oct 12 '22 18:10 FRex

What about:

zstd -d -c FILE.zst | cmp FILE -

Cyan4973 avatar Oct 17 '22 21:10 Cyan4973

If FILE's size changed (very common when editing text) it will do (potentially a lot, if the change is very deep into file) needless work, instead of just checking size of FILE on disk vs. size stored in FILE.zst

Right now -v -l reports original filesize but not the hash (it only says Check: XXH64), if it printed the hash, that'd enable writing a script that does all I said I'm looking for and more.

FRex avatar Oct 17 '22 21:10 FRex

OK, so you are looking for zstd -lv to report the actual value of the content hash, not just the fact that it exists. This is likely achievable.

Cyan4973 avatar Oct 17 '22 21:10 Cyan4973

That'd enable scripts to use zstd cli to do what I mentioned. Only problem I can imagine is multi-frame file with hash per frame.

El lun, 17 oct 2022 23:50, Yann Collet @.***> escribió:

OK, so you are looking for zstd -lv to report the actual value of the content hash, not just the fact that it exists. This is likely achievable.

— Reply to this email directly, view it on GitHub https://github.com/facebook/zstd/issues/3287#issuecomment-1281538584, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASNTZ2CRYCWGKWXLAXDKALWDXC3DANCNFSM6AAAAAARDQ57VY . You are receiving this because you authored the thread.Message ID: @.***>

FRex avatar Oct 17 '22 21:10 FRex

@Cyan4973

I've written the code to add printing checksum for single frame files with -v -l here: https://github.com/facebook/zstd/compare/dev...FRex:zstd:feature-zstd-cli-print-xxhash

If you find it acceptable I can create a PR or you can just copy paste it. If it's not acceptable let me know what I can fix to make it fit this requirements.

I skimmed https://github.com/facebook/zstd/blob/dev/CONTRIBUTING.md and tried to run make test but it contains (unrelated) errors in ../lib/dictBuilder/divsufsort.c that also happen in current dev branch.

I don't have any idea for how to print this hash per frame (and I personally don't think I need it, even for multi-gig files zstd seems to produce single frame file?).

Here's an example of running it (notice 618814bc in the output):

$ xxhsum.exe /e/one.txt ; echo ;  ./programs/zstd.exe -v -l /e/one.txt.zst /e/two.txt.zst
695a3bd7618814bc  E:/one.txt

*** zstd command line interface 64-bits v1.5.3, by Yann Collet ***
E:/one.txt.zst
# Zstandard Frames: 1
DictID: 0
Window Size: 194 KiB (198246 B)
Compressed Size: 38 B (38 B)
Decompressed Size: 194 KiB (198246 B)
Ratio: 5217.0000
Check: XXH64 618814bc

E:/two.txt.zst
# Zstandard Frames: 2
DictID: 0
Window Size: 194 KiB (198246 B)
Compressed Size: 76 B (76 B)
Decompressed Size: 387 KiB (396492 B)
Ratio: 5217.0000
Check: XXH64

FRex avatar Oct 20 '22 19:10 FRex

I've written the code to add printing checksum for single frame files with -v -l here: https://github.com/facebook/zstd/compare/dev...FRex:zstd:feature-zstd-cli-print-xxhash

I like it, this is a good PR

I skimmed https://github.com/facebook/zstd/blob/dev/CONTRIBUTING.md and tried to run make test but it contains (unrelated) errors in ../lib/dictBuilder/divsufsort.c that also happen in current dev branch.

It's obviously unrelated to your work. And it's surprising. divsufsort.c is a hosted 3rd party library. We generally don't touch it, except for some minor edits to pass some stringent compilations warnings. It's tested, as part of our CIs, so we would have expected to catch these issues before they reach your side. If necessary, we could take a look, in order to fix the issues you experienced, but that's a separate effort.

I don't have any idea for how to print this hash per frame (and I personally don't think I need it, even for multi-gig files zstd seems to produce single frame file?).

The normal scenario is one single frame, whatever the size of input. Multi-frames is more advanced. Typically, it happens when the content is produced in multiple sessions, or watermarks are added, or random access capabilities are added, etc.

It's fine if your PR only solves the "1-frame" scenario, it's the more important one, and the one that solves your issue.

Cyan4973 avatar Dec 02 '22 23:12 Cyan4973

I'm happy to hear the PR is good. Would you like to merge it? Do I need to reassign (c) to you or Facebook for this purpose?

I noticed concatenating two zst files with cat creates a multi-frame zst file that uncompresses to original two files, concatenated. I guess it can be useful for concatenating files without recompression in between.

FRex avatar Dec 06 '22 16:12 FRex

I noticed concatenating two zst files with cat creates a multi-frame zst file that uncompresses to original two files, concatenated. I guess it can be useful for concatenating files without recompression in between.

Yes, a classical scenario would be an append-only database, like a log system. In which case, new content is added every day or every hour. It's generally easier to simply append a new frame into the same file.

Cyan4973 avatar Dec 06 '22 17:12 Cyan4973

I'm happy to hear the PR is good. Would you like to merge it? Do I need to reassign (c) to you or Facebook for this purpose?

Generally, the authors of the patches push the PR themselves,

for this case though, I created : https://github.com/facebook/zstd/pull/3332 which tracks your patch from your fork.

Cyan4973 avatar Dec 06 '22 18:12 Cyan4973

Thank you. Is there anything else I have to do?

FRex avatar Dec 06 '22 20:12 FRex

A nb of CI tests have been failing on #3332, but they don't seem related to the patch itself, just give us some time to sort that out.

Cyan4973 avatar Dec 06 '22 20:12 Cyan4973

Patch merged

Cyan4973 avatar Dec 20 '22 01:12 Cyan4973