ratarmount icon indicating copy to clipboard operation
ratarmount copied to clipboard

Incomplete index for .tar.zst

Open powellnorma opened this issue 6 months ago • 11 comments

I created the .tar.zst with https://github.com/SaveTheRbtz/zstd-seekable-format-go - ratarmount loggs:

[Info] Try to open with rarfile
[Info] Try to open with tarfile
[Info] Detected compression zst for file object: <_io.BufferedReader name='archive.tar.zst'>
[Info] Undid zst file compression by using: IndexedZstdFile
Creating offset dictionary for archive.tar.zst ...
Position 902802605 of 112968843995 (0.80%). Remaining time: 8 min 8 s (current rate), 8 min 8 s (average rate). Spent time: 0 min 3 s
Position 1358774736 of 112968843995 (1.20%). Remaining time: 8 min 40 s (current rate), 8 min 17 s (average rate). Spent time: 0 min 6 s
Position 3794647863 of 112968843995 (3.36%). Remaining time: 7 min 50 s (current rate), 7 min 56 s (average rate). Spent time: 0 min 16 s
Position 5429518921 of 112968843995 (4.81%). Remaining time: 10 min 47 s (current rate), 8 min 42 s (average rate). Spent time: 0 min 26 s
Position 5916237504 of 112968843995 (5.24%). Remaining time: 7 min 19 s (current rate), 8 min 33 s (average rate). Spent time: 0 min 28 s
Position 6294844661 of 112968843995 (5.57%). Remaining time: 9 min 25 s (current rate), 8 min 35 s (average rate). Spent time: 0 min 30 s
Position 6720720693 of 112968843995 (5.95%). Remaining time: 8 min 32 s (current rate), 8 min 33 s (average rate). Spent time: 0 min 32 s
Position 7219378249 of 112968843995 (6.39%). Remaining time: 8 min 3 s (current rate), 8 min 28 s (average rate). Spent time: 0 min 34 s
Position 7689523853 of 112968843995 (6.81%). Remaining time: 7 min 42 s (current rate), 8 min 23 s (average rate). Spent time: 0 min 36 s
Position 7854005199 of 112968843995 (6.95%). Remaining time: 21 min 19 s (current rate), 8 min 39 s (average rate). Spent time: 0 min 38 s
Position 8305422074 of 112968843995 (7.35%). Remaining time: 7 min 44 s (current rate), 8 min 34 s (average rate). Spent time: 0 min 40 s
Position 9315929959 of 112968843995 (8.25%). Remaining time: 7 min 49 s (current rate), 8 min 24 s (average rate). Spent time: 0 min 45 s
Position 10249960142 of 112968843995 (9.07%). Remaining time: 7 min 42 s (current rate), 8 min 16 s (average rate). Spent time: 0 min 49 s
Position 10966962580 of 112968843995 (9.71%). Remaining time: 7 min 15 s (current rate), 8 min 9 s (average rate). Spent time: 0 min 52 s
Position 11644614155 of 112968843995 (10.31%). Remaining time: 7 min 34 s (current rate), 8 min 4 s (average rate). Spent time: 0 min 55 s
Position 12330155953 of 112968843995 (10.91%). Remaining time: 7 min 19 s (current rate), 7 min 58 s (average rate). Spent time: 0 min 58 s
Position 13223965556 of 112968843995 (11.71%). Remaining time: 7 min 25 s (current rate), 7 min 52 s (average rate). Spent time: 1 min 2 s
Position 15000695226 of 112968843995 (13.28%). Remaining time: 7 min 18 s (current rate), 7 min 41 s (average rate). Spent time: 1 min 10 s
Position 15229189612 of 112968843995 (13.48%). Remaining time: 14 min 15 s (current rate), 7 min 46 s (average rate). Spent time: 1 min 12 s
Resorting files by path ...
Creating offset dictionary for archive.tar.zst took 79.20s
[Info] The index does not yet contain zst block offset data. Will write it out.
Writing out TAR index to archive.tar.zst took 0s and is sized 31584256 B
[Info] Opened archive with tarfile backend.

It looks like it only indexes ~13%. When I use tar -I zstd -tvf archive.tar.zst all files get listed normally. Am I doing something wrong?

powellnorma avatar Jun 11 '25 15:06 powellnorma

Could you try calling ratarmount with --ignore-zeros?

mxmlnkn avatar Jun 11 '25 16:06 mxmlnkn

Theoretically, it could also happen (and I probably should fix that to avoid confusion), that the last file in the TAR is a very large one, so that it skips over it to the end of the TAR archive without updating the progress bar.

mxmlnkn avatar Jun 11 '25 16:06 mxmlnkn

Could you try calling ratarmount with --ignore-zeros?

That helped, thank you!

Does that mean this is a bug in https://github.com/SaveTheRbtz/zstd-seekable-format-go ?

powellnorma avatar Jun 11 '25 17:06 powellnorma

Could you try calling ratarmount with --ignore-zeros?

That helped, thank you!

Does that mean this is a bug in https://github.com/SaveTheRbtz/zstd-seekable-format-go ?

Isn't that tool only for zstd compression? If so, then the answer is probably no. --ignore-zeros ignores zero-blocks in the TAR file itself, which normally mark the end of the TAR file as per the file format. But sometimes, e.g., when concatenating two TARs, you don't want to stop at those zero blocks. That's what this option is for. How did you create the uncompressed TAR itself?

mxmlnkn avatar Jun 11 '25 17:06 mxmlnkn

Hm, I created it with GNU tar, the command was probably either: sudo tar -C "$src" --xattrs --xattrs-include='*' -S -cvf - . | zstdseek -f - -q 19 -o "$dst"

or: sudo tar --xattrs --xattrs-include='*' -S --zstd -cvf "$dst" . zstd -dc archive.tar.zst | zstdseek -f - -q 19 -o archive-seekable.tar.zst

powellnorma avatar Jun 11 '25 17:06 powellnorma

Weird, that wouldn't explain zero-blocks. Did you by chance try with -r? -r, --append Append files to the end of an archive. I think this option would lead to such a problem.

mxmlnkn avatar Jun 11 '25 18:06 mxmlnkn

No, after I have created the .tar.zst, I don't touch it - I use it as read only snapshot/archive. The only thing I potentially did was to recompress it via the above command so that the zst archive is seekable.
I wonder why tar -I zstd -tvf archive.tar.zst does not stumble over the zero blocks, since it lists files that ratarmount does not find without --ignore-zeros.

powellnorma avatar Jun 11 '25 19:06 powellnorma

That is indeed weird. Maybe there is some other problem after all. I doubt you can share those archives for testing. Does the same problem occur when mounting the uncompressed TAR with ratarmount? If so, what does this Python script print:

python3 -c 'import sys, tarfile;
[print(tarInfo.sparse, tarInfo.offset, tarInfo.offset_data, tarInfo.size, tarInfo.name)
for tarInfo in tarfile.open(sys.argv[1])]' archive.tar

This script prints the sparsity, offset, data offset, size, and name of all files in the TAR. E.g. for tests/nested-tar.tar from this repository:

None 0 512 0 foo
None 512 1024 0 foo/fighter
None 1024 1536 6 foo/fighter/ufo
None 2048 2560 10240 foo/lighter.tar

If the offset is much smaller than the TAR size and some names are missing, then the actual data after the last file would be interesting, i.e.: something like this to print up to the next four 512 B blocks after the foo/lighter.tar file, where the last two arguments are the two numbers right before foo/lighter.tar:

python3 -c 'import sys; from pprint import pprint;
f=open(sys.argv[1], "rb"); f.seek(int(sys.argv[2]) + int(sys.argv[3]));
pprint(f.read(4 * 512))' tests/nested-tar.tar 2560 10240

In this case, it is as expected only zeros:

(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
[...]
 b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')

I saw that you used sparse files (-S). This is a somewhat more rare and nonstandard feature. That's why I added it to the debug output because it might be the cause of the problem.

mxmlnkn avatar Jun 11 '25 19:06 mxmlnkn

Does the same problem occur when mounting the uncompressed TAR with ratarmount?

Yes it does. It seems to indeed have to do with sparse data, here are the last 2k bytes:

https://send.mni.li/download/d2cf47d4476cb829/#_EIMruDccGiPbO4M_OkyDg

powellnorma avatar Jun 11 '25 21:06 powellnorma

I have downloaded the file and I see the sparse information in there, but from this alone, I am currently not able to reproduce it. It also seems to be cut off. I'll have to see whether I can create a TAR from scratch, e.g., with fallocate and calling tar -S on it, to reproduce the problem.

If the problem also occurs with the Python test script above for listing the files, then it would become interesting how it behaves for different Python versions if you have access to other versions. And, at that point, the issue probably should be filed at the CPython project, which tarfile is a part of.

mxmlnkn avatar Jun 11 '25 22:06 mxmlnkn

If the problem also occurs with the Python test script above for listing the files, then it would become interesting how it behaves for different Python versions if you have access to other versions

It does: files are missing from that list created by:

python3 -c 'import sys, tarfile; [print(tarInfo.sparse, tarInfo.offset, tarInfo.offset_data, tarInfo.size, tarInfo.name) for tarInfo in tarfile.open(sys.argv[1])]' archive.tar

I tested with Python 3.13.3, 3.12.9, and 3.7.9, same result for all.

powellnorma avatar Jun 11 '25 23:06 powellnorma

I tried to create a reproducer with:

#!/bin/bash

folder="sparse_files"
mkdir -p -- "$folder"

filesCount=1000

for i in $( seq $filesCount ); do
    fileName="$folder/sparse-$i"
    echo "Create $fileName"
    fileSize=$(( RANDOM % 16 ))
    base64 /dev/urandom | head -c $(( fileSize * 1024 * 1024 )) > "$fileName"

    if [[ $fileSize -eq 0 ]]; then continue; fi

    for j in $( seq 16 ); do
        holeOffset=$(( RANDOM % fileSize ))
        fallocate --punch-hole --offset "${holeOffset}MiB" --length "1MiB" "$fileName"
    done
done

tar -C "$folder" --xattrs --xattrs-include='*' -S -cf - . | zstdseek -f - -q 19 -o "$folder".tar.zst

but failed to reproduce the problem.

mxmlnkn avatar Jul 03 '25 19:07 mxmlnkn

base64 /dev/urandom | head -c $(( fileSize * 1024 * 1024 )) > "$fileName"

Maybe the files also need "genuine" zero blocks? What if instead of base64, one stores just zeros (with additional sparse areas in them)?

I can still reproduce the issue with just the .tar file, if you want me to share the output of any further commands, let me know.

powellnorma avatar Jul 03 '25 21:07 powellnorma

Using /dev/zero instead does not help. Not that the fallocate --punch-hole option zeros out the specified ranges to make them sparse.

It would be ideal, if you could create a shareable reproducer. Maybe run the script:

python3 -c 'import sys, tarfile;
[print(tarInfo.sparse, tarInfo.offset, tarInfo.offset_data, tarInfo.size, tarInfo.name)
for tarInfo in tarfile.open(sys.argv[1])]' archive.tar

to determine the last working file. Then run the script with the ignore zeros option:

python3 -c 'import sys, tarfile;
[print(tarInfo.sparse, tarInfo.offset, tarInfo.offset_data, tarInfo.size, tarInfo.name)
for tarInfo in tarfile.open(sys.argv[1], ignore_zeros=True)]' archive.tar

Then search in that output for the last working file of the broken run above to find the presumably problematic file right after it.

Then try and create a tar with these two files and check if it reproduces the problem. If not, then I'm completely lost as to how to reproduce this. If these two files are by chance something that can be shared without privacy issues, it would be super helpful.

Or if not, maybe you could check whether there are some weird properties for the presumably problematic file, e.g., is it actually sparse, or is there some other issue with it, e.g., extended file attributes, which you are also trying to store.

mxmlnkn avatar Jul 03 '25 21:07 mxmlnkn

I did a quick test, basically copying from the ratarmounted .tar.zst the last few files that are not missing and the first files that are missing if not using ignore_zeros=True, like you suggested. Unfortunally, it did not reproduce the issue,.

But I noticed (du -sh):

19M	./userdata

After I copy it wtih cp -ra --sparse=always:

200M	./userdata

Maybe its just that the sparse blocks are not as granular with cp.

The last file, btw, (that is included in both lists) is this userdata file, which is an (sparse) image file of some android userdata partition.

I will try to unpack the tar, and then recreate it. If it is still reproducible, I can try to trim it down further

powellnorma avatar Jul 03 '25 22:07 powellnorma

Wow, I can reproduce it with this. Thank you. The second, smaller file is indeed missing with Python tarfile.

mxmlnkn avatar Jul 12 '25 19:07 mxmlnkn

I have opened two issues upstream. Let's see whether this is it. Unfortunately, this means that it could take a while for fixes to become usable in a new Python release. Maybe I could hack some kind of workaround in ratarmount. In the worst case, I could copy-paste the tarfile.py module and patch it, but that seems a bit excessive.

mxmlnkn avatar Jul 12 '25 22:07 mxmlnkn

I have found a duplicate issue for this bug on CPython from over 5 years ago, and there are lots of months-old tarfile PRs still awaiting review. That's why I decided to hotpatch this in ratarmount in addition to opening PRs on CPython. It is one of the advantages of programming in Python that this is possible, even if it can become rather ugly.

You can try out the fixed unreleased version with this:

python3 -m pip install --user --force-reinstall \
    'git+https://github.com/mxmlnkn/ratarmount.git@fix-sparse#egginfo=ratarmountcore&subdirectory=core' \
    'git+https://github.com/mxmlnkn/ratarmount.git@fix-sparse#egginfo=ratarmount'

It would be nice to know whether it fixes your issue.

mxmlnkn avatar Jul 14 '25 21:07 mxmlnkn

Tested it, and it solves my issue. Thanks! 👍

powellnorma avatar Jul 15 '25 11:07 powellnorma