ratarmount icon indicating copy to clipboard operation
ratarmount copied to clipboard

Recursive mount resource usage

Open jendap opened this issue 3 weeks ago • 7 comments

Let's grab a deb package and mount it with "-r"

wget http://security.ubuntu.com/ubuntu/pool/main/l/linux/linux-source-6.17.0_6.17.0-7.7_all.deb

ratarmount -r linux-source-6.17.0_6.17.0-7.7_all.deb

That takes 2+ minutes and 3+ GB ram on my notebook. The non-recursive is better at 10 seconds and 0.7 GB ram:

ratarmount linux-source-6.17.0_6.17.0-7.7_all.deb manual
ratarmount manual/control.tar.zst control
ratarmount manual/data.tar.zst data
ratarmount data/usr/src/linux-source-6.17.0.tar.bz2 source

Running time (ls -R linux-source-6.17.0_6.17.0-7.7_all/ | wc) was 3x faster with recursive mount.

It would be great ratarmount would take much less resources on mount :) But that is a different story. The recursive is sometimes borderline unusable.

jendap avatar Dec 05 '25 20:12 jendap

Honestly, this is an extreme case, even though it is a real file and therefore a real use case. I am not sure I can do much about this.

Layer 1: DEB

First, you have the deb archive linux-source-6.17.0_6.17.0-7.7_all.deb containing:

total 180M
-rw-r--r-- 1 root root  785 Oct 18 08:01 control.tar.zst
-rw-r--r-- 1 root root 180M Oct 18 08:01 data.tar.zst
-rw-r--r-- 1 root root    4 Oct 18 08:01 debian-binary

DEB archives are implemented with the libarchive backend. Unfortunately, these have the same runtime scaling complexity problems that archivemount suffers from because libarchive does not yet support constant-time seeking, which the recursive analysis might require. Adding the libarchive backend was nice to have some support for most formats, but unfortunately, it completely sacrificed all performance optimisations I implemented for TAR in particular.

To the user, it might not be clear that some formats are more performantly supported than others. I think this was discussed in an issue and I tried to highlight the distinction in the ReadMe, but the section is admittedly not directly at the top and might be overlooked and the distinction is only mentioned in the header "supported for random access". It could be explained more verbosely what this means.

  • [ ] Deb archives are ar archives. I think I was contemplating custom support with constant time seeking for that format because it seemed kinda similar to TAR and doable, if not easy, to implement. I can't find the issue; maybe it was only in my personal notes. It might be prudent to first check whether it really is a problem. Because it is uncompressed and mostly contains only a single large file, it might actually not be a performance bottleneck.

Layer 2+3: Zstandard + TAR

The deb archive might not even be the problem. Zstandard is also unfortunately unsuitable for random access. See this section. Unfortunately, a quick check with zstd -l data.tar.zst:

Frames  Skips  Compressed  Uncompressed  Ratio  Check  Filename
     1      0     179 MiB                       XXH64  mounted/data.tar.zst

shows that this Zstandard file was not prepared for random access. Unfortunately, the skippable frame format for zstd is quite obscure and not enabled by default :(. That's why I can't recommend Zstandard unconditionally. Supporting seeking in such Zstandard files seems impossible to me for now. I went to great lengths to support gzip with rapidgzip. I might be able to extend that approach to LZ4, but I am not sure I am able to extend it to Zstandard, partially because of Zstandard's complexity and the need to basically write a custom decompressor to do random seeking in formats not intended for random seeking.

Layer 4+5:

usr/
├── share
│   └── doc
│       └── linux-source-6.17.0
│           ├── changelog.Debian.gz
│           └── copyright
└── src
    ├── linux-source-6.17.0
    │   └── linux-source-6.17.0.tar.bz2
    └── linux-source-6.17.0.tar.bz2 -> linux-source-6.17.0/linux-source-6.17.0.tar.bz2

At this point, I felt ridiculed by this archive. Not only is it packed twice, no, it also uses two different archive formats and 3 different kinds of compression formats to do so... Just why? bzip2- and gzip-compressed TARs are the two best supported compression formats, but if the upper layers are not seekable in constant time, they still will be excruciatingly slow.

Fortunately, this is the last layer with the contents of that .tar.bz2 file being:

linux-source-6.17.0/
├── arch
│   ├── alpha
│   │   ├── boot
│   │   │   ├── bootloader.lds
│   │   │   ├── bootp.c
│   │   │   ├── bootpz.c
│   │   │   ├── head.S
│   │   │   ├── main.c
│   │   │   ├── Makefile

mxmlnkn avatar Dec 05 '25 22:12 mxmlnkn

My advice would be to unpack the deb and the .tar.zstd to a file system and then use ratarmount to mount the innermost .tar.bz2.

A heuristic like this could be implemented in ratarmount, but it might be more trouble than it's worth because it could flood the temporary folder, used to extract intermediary files.

I am not totally sure how it achieves 3 GB, though. My ballpark guess for memory usage would be more in line with the observed 700 MB because the .zstandard file might be completely held in memory. But the extracted .tar.bz2 file should not eat all that much memory. The ~100 K files are also not that much. The SQLite for that is 17 MB. It should not take that much more memory. Note that SQLite indexes (with file names) for recursive archives are not written to disk and therefore held only in memory. This makes recursive mounting less memory efficient, but in this case, I can only explain an overhead of 17 MB with that reasoning.

mxmlnkn avatar Dec 05 '25 22:12 mxmlnkn

The answer to this topic is: Thank you! Yes! Still, four ratarmount commands are way faster and take less memory than one recursive. There is an opportunity for improvement. Not a killer feature.

The bigger question is:

What is the goal for ratarmount?

For me it is the convenience of working with any archive (like it was not even there). To that end you've integrated a lot of archives into usable fuse mount and sqlite speeding up metadata operations. Thank you! How is the fast content access done? Exploiting the fact that modern cpu can decode far more bytes then vintage compression lookback window size. That is a cool trick! As a bit of a compression nerd I can really appreciate it! Your reply above is all about thinking how to do that trick for ar archive, lz4, zstd, ...

It is not the only way though. Would it make sense to just unpack it (~/.cache/ratarmount)? Maybe not completely. Perhaps keep the tar file instead of individual files. It may be uncompressed or recompressed (highly compressible blocks only) if we want to keep it around after first unmount. Yes, it does take more disk space. But it is nowhere near as costly as ram is. Such approach would work regardless of compressor. It should not be much work to implement / maintain. Is there seeking capable fork for a given compressor? Great! But it is also fine if there is none. The utility of working with archives can still go up.

Depends on what your (and ratarmount's) goals are.

BTW: This is not to a complain. I love ratarmount as is! :)

jendap avatar Dec 06 '25 20:12 jendap

What is the goal for ratarmount?

Initially, in 2019, the main goal was to support the trivial use case of nested non-compressed TAR files in constant time as should be trivially possible knowing contained file offsets and sizes in the large archive file.

Thanks to it being well-received by the community, it grew to first support bzip2, then gzip, performantly, and then I added support for all kinds of other formats, even if they were not so performant, just to have some unified solution for archives again, like archivemount. I may have gone a bit too far with that, with recursion, and remote files, and obscure formats, and filesystem transformations like renamings and union mounting, etc... Some combinations of these features are super hard to get as fast as the initial use case of non-compressed (nested) TARs.

I love programs that work out of the box. One of my main goals is to have that for ratarmount as best as possible. Some stuff, such as constant-time seeking in single-frame Zstandard and XZ files, is not possible for me in the midterm. It might not be impossible, but it could take me years.

But it is nowhere near as costly as ram is.

You are correct with costly, but the availability still depends on your system, as always... I know I am a huge outlier with this, but because of some suboptimal formatting, I have ~15 GB space in /tmp and ~/.cache/ and 5 times more RAM. I still try to reduce RAM usage, but my weird system might make me too lenient with it. The available cache folder size could also be dynamically queried to help ratarmount decide what solution to prefer for all systems, no matter how weirdly allocated. Then again, we are back at Zstandard with default settings, for which we do not know the decompressed size beforehand, which again makes it cumbersome to use. One could decompress it once to determine the file size, though. Hopefully, it should be much faster than writing to disk anyway.

mxmlnkn avatar Dec 07 '25 11:12 mxmlnkn

I like that story :) That sounds like we would have remarkable similar thinking. Where we differ is the ram and "could take me years".

15 GB /tmp (ramdisk) would be in most distro default for 32 GB ram. "5 times more RAM" = 96GB? That's well above average :)

When you say "could take me years"... Are you still thinking about forking the zstd codebase and adding the seeking like you did for gzip? That is not needed.

Just unpack the foo.tar.zst to ~/.cache/ratarmount/foo.tar. It is on disk. Are you worried the original file had 100x compression ratio and uncompressed will fill disk? Easy. Repack it into foo.tar.lz4 (or zstd level 1). Are you be worried about recompression being slow? Recompress only the blocks with high compression ratio. Still slow? Multithreading.

I'm totally fine to have 2 GB in (disk) cache when mounting 1 GB foo.tar.zst. I would expect < 100 MB ram usage even with the ~50 MB python overhead and fast random access.

It would work for everybody but kids. They want to be youtubers. They record at 4k+ and fill the disk to the last byte. So what? Ratarmount can't mount mp4 anyway :)

jendap avatar Dec 08 '25 11:12 jendap

15 GB /tmp (ramdisk) would be in most distro default for 32 GB ram. "5 times more RAM" = 96GB? That's well above average :)

Yes. I am aware. I also have a notebook with "only" 32 GB, which enables a bit more realistic testing.

When you say "could take me years"... Are you still thinking about forking the zstd codebase and adding the seeking like you did for gzip? That is not needed.

Yes. What is your suggestion?

Just unpack the foo.tar.zst to ~/.cache/ratarmount/foo.tar. It is on disk. Are you worried the original file had 100x compression ratio and uncompressed will fill disk? Easy. Repack it into foo.tar.lz4 (or zstd level 1). Are you be worried about recompression being slow? Recompress only the blocks with high compression ratio. Still slow? Multithreading.

Yes, I am worried about that. I don't see the need for recompression when I could simply leave it compressed ... Well, ok, recompression could be done with known-to-ideal parameters, such as level 1, and definitely with the skippable frame format for seeking. That would indeed be a cool idea

I'm totally fine to have 2 GB in (disk) cache when mounting 1 GB foo.tar.zst. I would expect < 100 MB ram usage even with the ~50 MB python overhead and fast random access.

It would work for everybody but kids. They want to be youtubers. They record at 4k+ and fill the disk to the last byte. So what? Ratarmount can't mount mp4 anyway :)

I am only half-joking, when I am replying with "can't mount MP4 yet" 😆 It is mentioned in #109 because, in the end, MP4 is just a container format for video, audio, subtitle streams, and attachments such as fonts. It might be cool to have implicit demuxing capabilities. Whether it is really needed by anyone, I am not sure. In the same vein of ridiculous-sounding formats, I have this soon-to-be-merged commit for HTML support. At least this is something I personally will use to extract some base64-encoded resources, such as images, from saved single-HTML files.

mxmlnkn avatar Dec 08 '25 11:12 mxmlnkn

Yes, please! Throw in a directory keyframes/timestamp.[jpg,webp,avif,...] :-) That would create infinite stream of bugs though (and that's assuming you use ffmpeg and not try your own thing).

I believe we're now finally on the same page about what can be done with disk cache! 👍

Dataurls? Cool :-) If you're bored I can suggest you some more stuff to "mount"!

But then again, you should think about the goals and priorities. Things like html may be fun and useful. I know. I have a script for it mysef :) People get lost in ratarmount readme already. And as you know "some combinations of these features are super hard".

You can improve the archives. You can simplify readme, add docs, make a website, package and promote ratarmount as archive as a folder tool. Or you may compete with half baked features with all sorts of other tools (like the forensic tools).

And again it is fun project. You do not have to think of it as a product. Feel free to do whatever you want. You're doing a great job Have fun :)

jendap avatar Dec 08 '25 17:12 jendap