magic-wormhole-protocols icon indicating copy to clipboard operation
magic-wormhole-protocols copied to clipboard

Dilated File Transfer

Open meejah opened this issue 3 years ago • 17 comments

Specification for a new file-transfer protocol based on Dilation.

meejah avatar Jun 30 '22 22:06 meejah

@felinira thanks for the feedback. If a realtime discussion is useful, happy to meet in Jitsi or IRC or similar...(I'm at UTC -6 currently).

meejah avatar Jul 04 '22 21:07 meejah

p.s. those pink/blue state-machine diagrams come from psuedo-code and are rendered by the Automat (https://github.com/glyph/automat) library ... so while it's certainly not working I am making some Proof-of-Concept level experiments while developing this

meejah avatar Jul 04 '22 22:07 meejah

@psanford or @jacalz maybe of interest?

meejah avatar Jul 08 '22 22:07 meejah

thanks @Jacalz

I just had a quick read through and it looks sensible to me. I'd personally be in favour of including compression (more specifically ZSTD compression) as part of the protocol at this time. However, I understand that it also might make sense to handle compression later on.

I did look at that, and agree that Zstandard sounds like a good approach.

I left it for a future enhancement (likely as a "feature") for several reasons:

  • this is already pretty big (as @felinira noted)
  • some implementations may wish to not do it (or not implement it) -- hence "feature"
  • there's a bunch of subtle stuff (keep the context for entire directory-offer? spec defaults params, or sender decides, or?)
  • could be security implications (I'm 95% sure "not" for this use-case, compress-than-encrypt has had some spectacular failures e.g. CRIME)

I would like to have compression though. If others agree that's "not too much more complexity" I could spec that out as well in this draft (as the one and only "feature" for now)...?

Vote with emojis:

  • :tada: leave until later
  • :rocket: : compression now

meejah avatar Jul 09 '22 16:07 meejah

If compression is done as a "feature" it still allows implementations to "upgrade slowly".

Idea being, you can implement just "mode=send" and "mode=receive" for essentially feature-parity with classic transfer (this should mean minimal changes to UX etc). This still means multiple files, if the sender offers more than one, so it's not precisely the same...

Then, you can implement "mode=connect" (possibly with very different UX) for multiple, bi-directional transfers. Some tools may never implement this mode (e.g. wormhole receive will only ever do mode=receive).

(If we specify compression) Separate from both the above, you can implement compression if desired. It could also be "implemented" but not used in some situations (e.g. smaller computers) as it will be a "feature".

Perhaps the above should be a section in the spec?

meejah avatar Jul 09 '22 16:07 meejah

In #1, I proposed a simple "format" option for compression. This is fairly simple, under the assumption that the compressed data can be decompressed without requiring any information about the compression settings (this is usually the case). It uses the usual feature intersection which makes it both opt-in and future proof for format additions.

The only real open question is how to deal with folders. Compressing folders as one stream is very likely to be more efficient (no new dictionary per file, deduplication across files), so we probably want that. The downside is that the individual files cannot easily be separated prior to decompressing. This is not a huge deal IMO, but it makes the "send file header then bytes" approach unfeasible.

I propose to skip the individual file headers in directory transfers and just concatenate the (compressed) bytes. The order of the file is deterministic and dictated by the send offer. The checksum message might be modified so that it is one message with a list of checksums, one for each file.

piegamesde avatar Jul 09 '22 16:07 piegamesde

The downside is that the individual files cannot easily be separated prior to decompressing. This is not a huge deal IMO, but it makes the "send file header then bytes" approach unfeasible.

No, you can still separate the "compressed bytes for file 1" from the "compressed bytes for file 2" as zstandard allows a "flush" operation. I have tested this. Many compressions also offer their own container format (as does zstd) but you don't have to use these.'

...and yes, we'll want to use the same compression context for as much as feasible as that gives better compression (usually).

(This is one for the "there are subtle things" point and partly why I wanted to skip this until later .. I don't think we should include more than one offer per compression context, because some implementations may process each offer / subchannel in a thread and insisting on one compression-context wouldn't allow that).

meejah avatar Jul 09 '22 17:07 meejah

I don't think we should include more than one offer per compression context

I agree, and actually I never thought about that before. To me, different transfers are independent in every sense of the word, and sharing a compression context would create interdependence between them.

In theory, one could make them share compression without making them dependent by providing some kind of pre-trained dictionary prior to the transfer. I don't know how common this is across compression formats, but at least zstd supports it. However, I'm against this for the same reason I rejected that approach for sending individual files: It adds quite a bit of complexity which I'd like to avoid.

piegamesde avatar Jul 09 '22 17:07 piegamesde

Just a random side-note here. Should we consider using protobuf instead of json? It seems to have big benefits in terms of size and performance.*

  • https://blog.mbedded.ninja/programming/serialization-formats/a-comparison-of-serialization-formats/

Jacalz avatar Jul 11 '22 20:07 Jacalz

The problem with ProtoBuf is that it does not have first-class support for self-describing messages: https://developers.google.com/protocol-buffers/docs/techniques#self-description

msgpack is conceptually a drop-in replacement from JSON, whereas trying to use ProtoBuf would be an entirely different beast. I haven't looked at ProtoBuf (& friends) a lot though, so I'd appreciate some analysis on how it could be integrated into Magic Wormhole.

piegamesde avatar Jul 11 '22 20:07 piegamesde

Just a random side-note here. Should we consider using protobuf instead of json? It seems to have big benefits in terms of size and performance.*

As specified, we're only using JSON for the "already built" parts that currently must use it (e.g. version exchange). That has to allow open-ended contents (and anyway would be a new mailbox protocol too). For protocol messages related to just this application protocol the preliminary conclusion is "msgpack". CBOR2 is also similar (and has better performance, IIRC).

I would not consider ProtocolBuffers; it uses pre-compiled specs and generated code, but without the advantages of Cap'n'Proto (essentially its successor) or flatbuffers (more-actually the successor because it's Google too). It supports fewer languages than either.

For "no-parse" options, I'd say "flatbuffers" is probably the front-runner. It supports many more languages, is "no-parse" and doesn't bundle other concerns like cap-n-proto does. It still has the disadvantage of requiring generated code though.

We could use specified messages like this, but consider "features" that have optional fields: then we'd have to have different versions of the flatbuffer specs for each feature. On the one hand, those conceptually exist anyway -- but can be handled by "if" statements easily instead of completely separate flatbuffers specs. (e.g. consider the "thumbnail" example in the "protocol expansion" section -- with msgpack/cbor2 one can do something like "if thumbnails on, include thumbnail in the Offer message" whereas with flatbuffers you'd need OfferThumbnail, OfferNoThumbnail etc (multiplied by all features with optional fields).

meejah avatar Jul 11 '22 22:07 meejah

I have a very rough proof-of-concept implementation in the Python client and here post updated Automat-produced state-machine diagrams. These are obviously tied to a particular implementation but I think are still generic enough to be useful.

In case it's not obvious, a DilatedFileSender machine is made for each offer and a corresponding DilatedFileReceiver is created on the receiving side (in response to a subchannel open).

This only supports File transfers right now, not directories or any control messages (i.e. "text messages").

wormhole dilatedfile DilatedFileTransfer m dot wormhole dilatedfile DilatedFileSender m dot wormhole dilatedfile DilatedFileReceiver m dot

meejah avatar Jul 14 '22 18:07 meejah

Checking back in here, although the PoC code doesn't support multiple files yet I think we should decide on https://github.com/magic-wormhole/magic-wormhole-mailbox-server/issues/31 before finalizing this transfer spec (and, perhaps that ticket would really be better in this repository in hindsight).

https://github.com/magic-wormhole/magic-wormhole-mailbox-server/issues/31

meejah avatar Sep 20 '22 22:09 meejah

@meejah Dilated? https://en.wikipedia.org/wiki/Dilation

abitrolly avatar Oct 06 '24 08:10 abitrolly

@abitrolly this refers to the "Dilation" feature in Magic Wormhole -- which I guess is supposed to get at the notion that you "make the wormhole bigger and more useful"...?

See https://github.com/magic-wormhole/magic-wormhole/pull/445 for WIP / PoC

meejah avatar Oct 09 '24 18:10 meejah

@meejah that's nice, but also it doesn't explain anything AT ALL. :D

abitrolly avatar Oct 09 '24 18:10 abitrolly

.. but also it doesn't explain anything AT ALL. :D

@abitrolly can you state that as a question? I thought you were asking why "Dilated" is in the title?

meejah avatar Oct 12 '24 16:10 meejah