magic-wormhole icon indicating copy to clipboard operation
magic-wormhole copied to clipboard

Do not create zipfile when sending much data

Open bofm opened this issue 8 years ago • 15 comments

Sometimes people have less free space on disk than the size of the zipped data they want to send. In such cases it would be reasonable to stream data and calculate checksums on the fly rather than to create a zipfile first.

bofm avatar Jun 23 '16 09:06 bofm

Hm, good idea. There's a subtle failure mode we have to be careful about, though, where the transfer gets interrupted partway through. Basically we just need to stream into a temporary directory and then either rename it into place when it succeeds, or delete the fragment if it fails.

I'll look into the zipfile module and see if there's an easy way to stream data into it. It might want to do seek() on the input stream, which would make it pretty tricky.

warner avatar Jun 28 '16 02:06 warner

Hm, maybe we need to launch tar or zip in a subprocess, and stream its data to the other side. We wouldn't know how large the zipfile will be ahead of time, though, so the best we could do to advise the receiver about what they're about to get would be to run du over the source directory. We'd also need to make sure to not allow a truncated transfer to look like a complete one (the current code knows exactly how many bytes to expect; the new implementation would need to send a special distinct "stop" message at the end).

I suspect subprocesses won't work too well on windows (well, and it's not like we can expect tar to be there), so this mode might fail if either side is on an OS that lacks these tools.

warner avatar Jul 12 '16 04:07 warner

No need to do subprocess. Python has both tar and zip support in the standard library. And the size of the whole directory can be calculated in Python combining os.walk and os.stat.

bofm avatar Jul 12 '16 13:07 bofm

Yeah, but we can only use the stdlib tar/zip libraries if there's a way to make them streaming-friendly:

  • On send, we need to create a fake file-like object to pass into zipfile.ZipFile() for writing. Each time the zipfile object calls obj.write(data), we need to send that data over the wire. If the zipfile object ever calls obj.seek(), then we're hosed (and I assume it will want to seek back to the beginning, after writing out all the data, to fill in the directory table).
  • On receive, the ZipFile will be calling obj.read(length) on our file-like object, which needs to return the compressed data. But the flow control is going in the other direction: the receiver reacts to new data arriving, rather than asking for data to be input. So we'd either have to run the ZipFile in a thread somehow, or write the whole stream to a tempfile (so we can be sure it's all available before obj.read() gets called), which uses up just as much disk space as before.

Running them in a subprocess would let us use stdin/stdout as the file-like object, and deal with the flow-control in the pipe that sits between them.

One approach might be to skip the zipfile entirely and send each file, one at a time, along with a header that indicates everything the zipfile would have carried. So the protocol alternates between e.g. length-prefixed JSON blobs (that include SIZE and PATH and MODE), then SIZE bytes of data (maybe compressed). The receiver would write files to a neighboring temporary directory (tmpdir/PATH) as they arrived. The last record would be a JSON blob with a special "I'm done" flag, which would signal the receiver to move the tempdir into place at the final output path.

We'd lose the benefit of compression across files, unfortunately. But it wouldn't need the temporary zipfile.

warner avatar Jul 20 '16 21:07 warner

Please check this example. https://gist.github.com/bofm/8a34f5e8857d21a84c4e51799fc527f2

bofm avatar Jul 26 '16 13:07 bofm

Ah, thanks for the pointer, that's super helpful.. I didn't know you could get compression out of a TarFile like that.

I think we may still need to put the pack/unpack into a separate thread, since I think tar.extractall() will do a blocking read of the file-like socket wrapper, and that won't work when the Twisted reactor is in the driver's seat (it's a "don't call us, we'll call you" sort of arrangement). Same thing for the tar.add(): that will do a blocking write, and when the write buffer is full, it will just halt everything.

I'll poke around the twisted repo to see if there are any examples of interfacing with TarFile or ZipFile like this, or a more generic pattern of doing the blocking filter/transformation function in a separate thread, with suitable flow control.

warner avatar Jul 27 '16 00:07 warner

What's the current state of this? A --no-zip option would be great when sending anything bigger than a few GBs.

jungerm2 avatar Feb 24 '22 19:02 jungerm2

@jungerm2 This requires backwards-incompatible changes to the file transfer protocol. It will hopefully happen eventually, but not any time soon. FWIW, the Rust implementation works around this with an ugly hack involving tar balls. (The data is streamed at the cost of having to read it twice while sending and not being able to automatically unpack it on the receiving end, although the latter could probably be fixed)

piegamesde avatar Feb 24 '22 20:02 piegamesde

@piegamesde Why would it require an incompatible change? Just stream the zip. The standard zip utility shipped with my distro explicitly supports sending the output of the zip command to standard out. Sure you burn twice as much CPU, but CPU is "renewable" and SSD is not. Writing a bunch of 32GB zip files is going to cause pointless flash wear.

raingloom avatar Jul 27 '23 13:07 raingloom

Just stream the zip.

Unfortunately, the protocol needs to know the size of the compressed archive in advance, so this nice idea gets non-trivial really quickly. There are two ways to achieve this:

  1. Stream-zip the file once to calculate the size, then stream-zip for sending
    • Apart from requiring twice the read IO and CPU compression time, this has the disadvantage that the archive generation must be deterministic. Additionally, one needs to be really careful around live file system changes.
  2. Disable the compression and calculate/predict the zip file size in advance based on metadata, then stream and send

I've done both in wormhole-rs but for the tarball hack. The first was really buggy and resulted in a lot of transfer errors. The second works pretty well, but only because tar is a dead simple format.

So if you can find a way to predict the size of a zip file based on file names+sizes and then stream a zip file that will have precisely that size, we could make this work.

(Actually, there is a third option: have a conservative estimate of the zip file size, and if the streamed zip turns out to be smaller than advertised then pad the rest with zeroes in such a way that it doesn't affect the zip file's contents. It's a bit wasteful but might work. Haven't tried that yet though)


Contributions in this area are welcome, although personally my focus is on building protocol improvements which don't require any of these hacks.

piegamesde avatar Jul 27 '23 15:07 piegamesde

I'm working on the streaming-twice solution, with compression disabled, because in my specific use case the files are individually compressed already (JPEGs) so there is zero point in trying to compress them again. I tried using the builtin zipfile module first but couldn't figure out how to convince it not to do seeks (even though streaming is supposed to be supported) so instead I switched to the stream-zip library. I also considered using the external zip command but realized that might not be ideal on weird niche operating systems that don't support such standard tools, like Windows.

I started work with the assumption that while an upper approximation of the size should be reasonable, there are probably buggy badly written clients that will error if the actual size is not the same.

And yeah, a better protocol is what would be best, but I just wanted to send this one photo folder to a friend, so I thought a quick hack on my side would be an adequate workaround.

raingloom avatar Jul 28 '23 13:07 raingloom

There is a better protocol most-of-the-way designed, and a prototype of (some of) it here: https://github.com/magic-wormhole/magic-wormhole/pull/445

If you're interested in pushing this forward further I'm happy to pair-program with you to get started!

The main motivation there is indeed for transferring multiple files (including in both directions, optionally) to better recognize that the "hard work" for the humans is setting up the secure connection in the first place (and give more UX options for client applications).

meejah avatar Jul 28 '23 19:07 meejah

meejah @.***> writes:

There is a better protocol most-of-the-way designed, and a prototype of (some of) it here: #445

If you're interested in pushing this forward further I'm happy to pair-program with you to get started!

Sadly I'm pretty busy with other projects, I was hoping a quick fix would be enough for this bug. I'll try looking into it but I can't promise anything long term. But I'll try to finish my quick fix once I get back to my usual work machine.

raingloom avatar Aug 03 '23 22:08 raingloom

Something that might be useful here is zipstream-ng. It's a Python library that makes streaming zip files with a up-front size calculation (that doesn't involve double-compression) pretty easy. It also handles using Zip64 extensions when adding large/many files. (disclaimer: I wrote it)

Quick example:

from zipstream import ZipStream

zs = ZipStream.from_path("/path/to/dir/")  # create stream from directory of files
compressed_size = len(zs)  # calculate the final size of the stream (without reading any file data)

# read file data as the zip is generated
for chunk in zs:
    # <write chunk to fp, send over network, etc>

Note that because how well files compress cannot be predicted without reading them first, calculating the final size only works for uncompressed zip files. If compression is more important that knowing the final size, the ZipStream.from_path function also takes compress_type and compress_level parameters at the cost of len no longer working for the resulting stream.

If there's interest in this, I can probably whip up at least a proof of concept.

pR0Ps avatar Aug 09 '23 15:08 pR0Ps

Proof of concept here: https://github.com/magic-wormhole/magic-wormhole/compare/master...pR0Ps:magic-wormhole:feature/zipstream-ng

I tested sending ~90GB and the transfer started immediately and finished successfully with minimal memory usage. Speed was about the same as sending a non-zipped file.

Note that this is only a proof of concept, not something I meticulously crafted and recommend merging as-is. I'd be open to working with someone who knows the codebase to turn it into an actual PR if there's interest. Or feel free to just take this as a starting point and run with it.

Also, if there are features that zipstream-ng is missing that would make a proper implementation of this easier/cleaner/better, let me know, I'm open to discussing them.

EDIT: Pulled the helper classes out into a library (iterable-io) and refactored the PoC to use it instead.

EDIT: Cleaned up the code somewhat and submitted a PR: #503

pR0Ps avatar Aug 10 '23 02:08 pR0Ps