streams icon indicating copy to clipboard operation
streams copied to clipboard

Chunking with Streams

Open willchan opened this issue 11 years ago • 5 comments
trafficstars

I was chatting with jyasskin and agl (explicitly chose not to cc them on this thread, although maybe I should have) over dinner today, talking about new C++ standard library proposals around transforms, which were very similar to streams. One thing stuck out in particular to me, where partial writes/reads matter for chunked protocols. Looking at the TLS record format, you can see that there's a MAC on the payload. The standard way to do this with a streaming interface is to, on each write invocation, slap on the MAC. This would be implicit to the interface, and not explicit anywhere. You'd just have to know that a stream that wrapped a SSL connection did this. Which is fine. On the read side, without an internal buffer in the stream, you can't read fewer bytes than are contained in the payload of a TLS record, since you can't partially read bytes until you verify the integrity of that data via the MAC, which is appended to the record.

Of course, this breaks some of the compositional abstraction with streams. You'd like to just pipe and compose streams willy nilly without thinking about this, but if you actually do care about the performance (and to a lesser degree, correctness, since you may break functionality if the internal buffer is too small to absorb an entire TLS record's payload + MAC), then you will need to think about this at some point.

Just some food for thought. I am writing it up here for posterity's sake. Feel free to close.

willchan avatar Aug 19 '14 02:08 willchan

Hmm, quite interesting. I don't quite see how it breaks the compositional abstraction though, as all of this could be hidden away behind the stream's methods? E.g. write() would do that for you, and read() would make sure to only return chunks of at least the right size. For readInto(), it could reject if the requested number of bytes is too low, or it could use an internal buffer.

Always love these kind of threads about the lower-level details of making performant streams wrapping real-world I/O, so thanks for opening, and looking forward to learning more :D.

domenic avatar Aug 19 '14 21:08 domenic

Sorry, I forgot to explain how it breaks the compositional abstraction. Indeed, write() would do that for you, but there's a performance consequence in how you invoke write() that is not explicit in the API. Specifically, the most common way of doing this would be to, for each data chunk passed to write(), encrypt the payload, and then MAC it and put that payload+MAC into a TLS record. So, if you use tiny write() calls, you get a lot of MAC overhead (computational and protocol byte efficiency, since the payload:overhead ratio shrinks). If you use huge write() calls, you get gigantic TLS records, which is problematic for latency since the receiver has to receive all the packets for the TLS record before verifying the MAC and processing the payload. I've analyzed this in the wild and have seen significant latency costs.

So basically, the size of the chunk passed into write() can be important, which somewhat breaks the stream abstraction since it's not obvious when using an abstract stream (which may compose other streams) whether or not it is sensitive to this kind of chunking.

On the read side, it's mostly fine. The biggest issue I can think of is if you are using an internal buffering strategy and the buffer is not large enough to process an entire TLS record. That would lead to deadlock. Then again, if you're implementing a protocol in a web app, you shouldn't be using the higher level internal buffering strategy in the first place, but take full control of buffering.

willchan avatar Aug 19 '14 22:08 willchan

Ah I see, that makes sense. Let me posit a naive solution and see what you think.

  1. The stream has a range of chunk sizes it prefers, say 1024-2048 bytes (just guessing here)
  2. If an incoming chunk is <1024 bytes, it holds it until it gets something that puts it over 1024 bytes in order to create a record.
  3. If an incoming chunk is >2048 bytes, it breaks it up into smaller chunks before creating records. (Or, if the result of step 2 ends up >2048 bytes, the same happens.) If you have e.g. a 2049 byte chunk this will result in holding a 1-byte chunk until another 1023 bytes come in, similar to 2.

I guess the tradeoff here is latency, as we keep chunks around waiting for them to accumulate to the right size. On the other hand I could imagine this being tunable with heuristics. (E.g., in addition to setting lower byte limits, set upper time limits.) It would also be tricky to implement the splitting of chunks because you'd have to ensure it's done in a zero-copy way.

Also this assumes that when the user says write(chunk), they're not expressing a desire to write that chunk as an atomic unit, but instead just a desire to get that data into the underlying sink eventually. Which is a bit different than you might first expect.

Very interesting stuff.

domenic avatar Aug 19 '14 22:08 domenic

Yep, you've nailed the tradeoffs here. The stream implementation can try to be smart and use heuristics to trade off between efficiency and latency. Note that the specific byte sizes you chose make sense for TLS when doing web browsing, but if we're talking about a HTTPS download, when you don't actually care about latency and only about throughput/efficiency, the heuristics are off.

From the standpoint of writing a high performance networking app, I want absolute control, so I'd probably want write(chunk) to write that chunk as an atomic unit. And I'd have to do due diligence myself to understand exactly all the concrete streams getting composed, so I understand the performance consequences. This is indeed what Chromium does when we layer byte streams in C++. We write the extra complicated code that understands all the layers of streams and writes appropriately sized chunks.

For a naive app, they're indeed generally not expressing a desire to write that chunk as an atomic unit. And a heuristic solution like you propose above may be best for them.

willchan avatar Aug 19 '14 22:08 willchan

Update from 2017: hints about good chunk size propagate down a pipe via desiredSize, but "narrow" sections of the pipe or non-byte sections will lose information. There is some language in the definition of pipeTo() which permits implementations to be intelligent about chunk size.

There is an open question about whether we will add more explicit support for chunking when type: 'bytes' WritableStreams are specced. I'm labelling this issue "writable streams" for that reason. A lot depends on how we choose to implement optimised byte writers, which is currently a wide open question: #680.

ricea avatar Feb 17 '17 10:02 ricea