magic-wormhole
magic-wormhole copied to clipboard
Do not create zipfile when sending much data
Sometimes people have less free space on disk than the size of the zipped data they want to send. In such cases it would be reasonable to stream data and calculate checksums on the fly rather than to create a zipfile first.
Hm, good idea. There's a subtle failure mode we have to be careful about, though, where the transfer gets interrupted partway through. Basically we just need to stream into a temporary directory and then either rename it into place when it succeeds, or delete the fragment if it fails.
I'll look into the zipfile module and see if there's an easy way to stream data into it. It might want to do seek()
on the input stream, which would make it pretty tricky.
Hm, maybe we need to launch tar
or zip
in a subprocess, and stream its data to the other side. We wouldn't know how large the zipfile will be ahead of time, though, so the best we could do to advise the receiver about what they're about to get would be to run du
over the source directory. We'd also need to make sure to not allow a truncated transfer to look like a complete one (the current code knows exactly how many bytes to expect; the new implementation would need to send a special distinct "stop" message at the end).
I suspect subprocesses won't work too well on windows (well, and it's not like we can expect tar
to be there), so this mode might fail if either side is on an OS that lacks these tools.
No need to do subprocess. Python has both tar and zip support in the standard library. And the size of the whole directory can be calculated in Python combining os.walk
and os.stat
.
Yeah, but we can only use the stdlib tar/zip libraries if there's a way to make them streaming-friendly:
- On
send
, we need to create a fake file-like object to pass intozipfile.ZipFile()
for writing. Each time the zipfile object callsobj.write(data)
, we need to send that data over the wire. If the zipfile object ever callsobj.seek()
, then we're hosed (and I assume it will want to seek back to the beginning, after writing out all the data, to fill in the directory table). - On
receive
, theZipFile
will be callingobj.read(length)
on our file-like object, which needs to return the compressed data. But the flow control is going in the other direction: the receiver reacts to new data arriving, rather than asking for data to be input. So we'd either have to run theZipFile
in a thread somehow, or write the whole stream to a tempfile (so we can be sure it's all available beforeobj.read()
gets called), which uses up just as much disk space as before.
Running them in a subprocess would let us use stdin/stdout as the file-like object, and deal with the flow-control in the pipe that sits between them.
One approach might be to skip the zipfile entirely and send each file, one at a time, along with a header that indicates everything the zipfile would have carried. So the protocol alternates between e.g. length-prefixed JSON blobs (that include SIZE and PATH and MODE), then SIZE bytes of data (maybe compressed). The receiver would write files to a neighboring temporary directory (tmpdir/PATH) as they arrived. The last record would be a JSON blob with a special "I'm done" flag, which would signal the receiver to move the tempdir into place at the final output path.
We'd lose the benefit of compression across files, unfortunately. But it wouldn't need the temporary zipfile.
Please check this example. https://gist.github.com/bofm/8a34f5e8857d21a84c4e51799fc527f2
Ah, thanks for the pointer, that's super helpful.. I didn't know you could get compression out of a TarFile
like that.
I think we may still need to put the pack/unpack into a separate thread, since I think tar.extractall()
will do a blocking read of the file-like socket wrapper, and that won't work when the Twisted reactor is in the driver's seat (it's a "don't call us, we'll call you" sort of arrangement). Same thing for the tar.add()
: that will do a blocking write, and when the write buffer is full, it will just halt everything.
I'll poke around the twisted repo to see if there are any examples of interfacing with TarFile
or ZipFile
like this, or a more generic pattern of doing the blocking filter/transformation function in a separate thread, with suitable flow control.
What's the current state of this? A --no-zip
option would be great when sending anything bigger than a few GBs.
@jungerm2 This requires backwards-incompatible changes to the file transfer protocol. It will hopefully happen eventually, but not any time soon. FWIW, the Rust implementation works around this with an ugly hack involving tar balls. (The data is streamed at the cost of having to read it twice while sending and not being able to automatically unpack it on the receiving end, although the latter could probably be fixed)
@piegamesde Why would it require an incompatible change? Just stream the zip. The standard zip utility shipped with my distro explicitly supports sending the output of the zip
command to standard out. Sure you burn twice as much CPU, but CPU is "renewable" and SSD is not. Writing a bunch of 32GB zip files is going to cause pointless flash wear.
Just stream the zip.
Unfortunately, the protocol needs to know the size of the compressed archive in advance, so this nice idea gets non-trivial really quickly. There are two ways to achieve this:
- Stream-zip the file once to calculate the size, then stream-zip for sending
- Apart from requiring twice the read IO and CPU compression time, this has the disadvantage that the archive generation must be deterministic. Additionally, one needs to be really careful around live file system changes.
- Disable the compression and calculate/predict the zip file size in advance based on metadata, then stream and send
I've done both in wormhole-rs but for the tarball hack. The first was really buggy and resulted in a lot of transfer errors. The second works pretty well, but only because tar is a dead simple format.
So if you can find a way to predict the size of a zip file based on file names+sizes and then stream a zip file that will have precisely that size, we could make this work.
(Actually, there is a third option: have a conservative estimate of the zip file size, and if the streamed zip turns out to be smaller than advertised then pad the rest with zeroes in such a way that it doesn't affect the zip file's contents. It's a bit wasteful but might work. Haven't tried that yet though)
Contributions in this area are welcome, although personally my focus is on building protocol improvements which don't require any of these hacks.
I'm working on the streaming-twice solution, with compression disabled, because in my specific use case the files are individually compressed already (JPEGs) so there is zero point in trying to compress them again. I tried using the builtin zipfile module first but couldn't figure out how to convince it not to do seeks (even though streaming is supposed to be supported) so instead I switched to the stream-zip library. I also considered using the external zip command but realized that might not be ideal on weird niche operating systems that don't support such standard tools, like Windows.
I started work with the assumption that while an upper approximation of the size should be reasonable, there are probably buggy badly written clients that will error if the actual size is not the same.
And yeah, a better protocol is what would be best, but I just wanted to send this one photo folder to a friend, so I thought a quick hack on my side would be an adequate workaround.
There is a better protocol most-of-the-way designed, and a prototype of (some of) it here: https://github.com/magic-wormhole/magic-wormhole/pull/445
If you're interested in pushing this forward further I'm happy to pair-program with you to get started!
The main motivation there is indeed for transferring multiple files (including in both directions, optionally) to better recognize that the "hard work" for the humans is setting up the secure connection in the first place (and give more UX options for client applications).
meejah @.***> writes:
There is a better protocol most-of-the-way designed, and a prototype of (some of) it here: #445
If you're interested in pushing this forward further I'm happy to pair-program with you to get started!
Sadly I'm pretty busy with other projects, I was hoping a quick fix would be enough for this bug. I'll try looking into it but I can't promise anything long term. But I'll try to finish my quick fix once I get back to my usual work machine.
Something that might be useful here is zipstream-ng
. It's a Python library that makes streaming zip files with a up-front size calculation (that doesn't involve double-compression) pretty easy. It also handles using Zip64 extensions when adding large/many files. (disclaimer: I wrote it)
Quick example:
from zipstream import ZipStream
zs = ZipStream.from_path("/path/to/dir/") # create stream from directory of files
compressed_size = len(zs) # calculate the final size of the stream (without reading any file data)
# read file data as the zip is generated
for chunk in zs:
# <write chunk to fp, send over network, etc>
Note that because how well files compress cannot be predicted without reading them first, calculating the final size only works for uncompressed zip files. If compression is more important that knowing the final size, the ZipStream.from_path
function also takes compress_type
and compress_level
parameters at the cost of len
no longer working for the resulting stream.
If there's interest in this, I can probably whip up at least a proof of concept.
Proof of concept here: https://github.com/magic-wormhole/magic-wormhole/compare/master...pR0Ps:magic-wormhole:feature/zipstream-ng
I tested sending ~90GB and the transfer started immediately and finished successfully with minimal memory usage. Speed was about the same as sending a non-zipped file.
Note that this is only a proof of concept, not something I meticulously crafted and recommend merging as-is. I'd be open to working with someone who knows the codebase to turn it into an actual PR if there's interest. Or feel free to just take this as a starting point and run with it.
Also, if there are features that zipstream-ng is missing that would make a proper implementation of this easier/cleaner/better, let me know, I'm open to discussing them.
EDIT: Pulled the helper classes out into a library (iterable-io
) and refactored the PoC to use it instead.
EDIT: Cleaned up the code somewhat and submitted a PR: #503