panamax icon indicating copy to clipboard operation
panamax copied to clipboard

Download crates into uncompressed or lightly compressed container archive

Open DrChat opened this issue 3 years ago • 5 comments

Windows is very slow when handling many small files, and the entire crates mirror roughly totals to almost 1 million small files. As you can imagine, that makes it extremely painful to transfer the mirror via normal filesystem operations such as copy and paste.

However, one way to circumvent this cost is to wrap the small files in a container file, such as a zip archive. You don't even have to compress it - just having a single file to copy will make the file transfer much more efficient.

It'd be a really nice quality-of-life feature if panamax offered this in the future.

DrChat avatar Sep 10 '21 04:09 DrChat

Yeah, I like this idea! When I ran panamax on ZFS, I had a lot of the same issues with slowness. It was basically a sole reason for me to use ext4 on Linux, which is unfortunate.

I previously avoided doing such any sort of zipping because I'd normally serve the mirror directory directly via nginx with no need for panamax on the server side, however with panamax serve, these files could even be served directly from an uncompressed zip file. The only tough part would be the incremental updates via sync, which would likely mean multiple zip files.

It would be dead simple to just append more files into a zip, however for anyone transferring the mirror directory from one machine to another (for example with rsync), it would try to copy the entirety of the zip every time. But then if serve directly served from these zip files, it would need some sort of index to figure out which zip file to read from (or it would need to scan through all zips, which I'm sure would slow things down).

But perhaps I'm overcomplicating things. ROMT has pack and unpack commands that basically download all the new crates, then load them up into a tarball, then you copy and unpack the tarball on the other side.

Maybe panamax could have a sync --pack-crates command argument that loads new crates into a zip file, then another unpack command that loads that zip file into the crates directory (or does something fancy like add them to a "master" zip file so they're all in one place).

Do you have any thoughts on any of these ideas? I definitely agree that using a container for crates is a good idea.

k3d3 avatar Sep 10 '21 04:09 k3d3

Hmm... good points! Perhaps there is a way to handle differential updates? My use-case is to copy the entire mirror to an airgapped system, wherein the airgapped system already has a base image of the entire crates repository. For that, I was even thinking of something like an option to create a delta update from one crates.io-index SHA-1 hash to another (and since crates can really only be added, it'll be straightforward).

Multiple zip files sounds like a potential solution, though it does make me wonder if there are any archive formats that play well with rsync for delta updates on a single archive?

DrChat avatar Sep 10 '21 14:09 DrChat

Unfortunately I don't think there's any specific archive format that would work well with rsync, since (I believe) it just checks file modification time, and uploads if it's newer. If your machine is airgapped, I assume a better example of a transfer method would be via a USB stick.

So I have an idea on how to handle this - let me know what you think:

Since we already have the ability to download crates in delta, we could add a parameter onto panamax sync like I had mentioned in the previous comment. However, what we could do is create a new directory called /packed.

Each time you run panamax sync --pack-crates, a new zip file e.g. crates_2021_09_10-08_49_00.zip gets created into this /packed directory. Each call to sync creates a new zip file, each being a delta of the crates since the last sync call.

Then, on the other side, you run panamax unpack which will read from all zip files in /packed and extract them into /crates. Now this means nginx can still be used, and panamax serve doesn't need to do any crazy zip trickery.

And that's it. The deltas would work the same as sync without the parameter because crates.io-index tracks the commit position of the master branch vs origin/master, and updates master on successful sync. As long as it treats saving crates directly the same as saving crates to a zip, that should be fine.

Also, the zip filename is based on date and time, but it could also be based on commit ID(s) too. I'm not sure if that would buy much though.


Now, with this idea in mind, I think the USB stick based solution would go something like this:

  • On the internet side, you run panamax init and panamax sync --pack-crates. This will give you the rustup files, as well as one big zip file to start off.
  • You copy the mirror directory to USB stick simply with robocopy /mir. That should add files that don't exist, update files that are newer in the source, and delete files that no longer exist in the source.
  • On the isolated side, you copy (via robocopy /mir /xd crates) from USB to local directory. That should mirror the same way, but not delete any files that exist in /crates.
  • Next, run panamax unpack to unpack all zips from /packed into /crates.

The only issue I see with this solution is that old zip files in /packed wouldn't get purged automatically, taking up unnecessary space on the isolated side. Even if panamax unpack deleted zip files after completion, robocopy would just copy them back into place.

There would need to be a way to state on the internet side that an unpack was successful (or that the zip was deleted), which I guess could be an exercise left to the reader.

k3d3 avatar Sep 11 '21 00:09 k3d3

I think one of the complications of that previous idea is because there's a mixture of copying an updated directory, and copying a new, completely different delta file.

So here's a second idea:

Rather than packing just the crates, we pack everything that has been updated since the last run. That includes the crates, crates.io-index, and all of the new rustup files, including the history toml files.

Now, when we run panamax sync --pack, everything gets thrown into a packed_2021-09-10_08_49_00.zip file. Now, we don't even have to care about robocopy at all - we just take the new packed.zip file, transfer it via USB, run panamax unpack, and that's it.


That would be much nicer to use from a UX standpoint, however it would be more difficult to code, and raises some questions. For example, can you store a subset of git commits on disk (for crates.io-index updates)? You might have to copy the whole crates.io-index into the zip file every time.

But maybe in the end, it'll end up a cleaner solution.

k3d3 avatar Sep 11 '21 00:09 k3d3

So, I'm not sure how practical this idea is if you're looking to go cross-OS compatible, but why not use a loopback filesystem in a file? In linux, you can mount any file as a disk, create a filesystem inside it, and read/write files into it just like any other disk.

I see two ways of handling this:

  1. Let the user themselves use this option. They can create the file, format it with a preferred filesystem, mount it to a location, and then sync panamax mirror inside of it. When it's time to copy over to another system, just unmount, copy the single large file, and then remount.
  2. See if there's a rust library to do this without having to mount it on the host itself. Like some kind of virtual file system. You wouldn't need to worry about hosting multiple zip files or special cases for delta updates. Just open the virtual file system (or disk-in-a-file) and do the same operations panamax is already doing.

Either way, you end up with a single, large file that can be easily copied and mounted. I know windows can do this with vhdx files, and I'm willing to bet there's a way for linux to read/write to a vhdx, so that might be a cross-platform solution.

This doesn't necessarily fix the rsync issue directly, but option 1 works with rsync by simply using rsync on the mounted directory, and option 2 would be a preference with panamax. Like panamax sync --file my-mirror-in-a-file.panamax instead of panamax sync my-mirror-in-a-folder, so the user could choose which option they like.

Whichever method you DO choose however, the rsync friendliness can be guaranteed by ensuring that the output file format is always the same for the same inputs. In other words, if I write files A, B, and C to mirrorfile1, and then later append D to mirrorfile1, as long as the A, B, and C portions remain the same, rsync will only transfer portion D of the file from the source to the destination (see rsync algorithm).

I think a single large zip would be fine, assuming that the zip format will compress each file to the exact same output each time. If file A always becomes the exact same compressed bits, it doesn't even matter where it is in the file. Rsync will use the existing data in the destination and simply insert changes between the existing blocks of data.

In fact, gzip is mentioned in that article as having an 'rsyncable' mode to ensure the file is syncable. I do not believe gzip itself can be used as a container, but perhaps panamax can either store each of the files, individually gzipped, into a single large container file, and serve them by dynamically ungzipping them, or perhaps store them all in a container file and then gzip the whole thing.

I think the approach of putting a bunch of gzipped files inside a non-compressed container file is an interesting idea, considering many web clients support gzip as a compression for over-the-wire transfer, and your panamax server can just serve them raw, allowing the client to uncompress them individually on its side. I'm not sure how this might perform, though.

I hope any of the options have at least triggered some ideas for you.

damccull avatar Jun 22 '23 03:06 damccull