tus-resumable-upload-protocol icon indicating copy to clipboard operation
tus-resumable-upload-protocol copied to clipboard

Add Tus-Min/Max-Chunk-Size headers

Open Acconut opened this issue 7 years ago ā€¢ 13 comments

See #89 for the underlying discussion.

@kvz @janko-m

Acconut avatar Aug 30 '16 13:08 Acconut

As part of this pr, I think adding @janko-m as a contributor and a feature version bump could be cool

kvz avatar Aug 30 '16 13:08 kvz

@Acconut Great work! šŸ‘

I just wanted to briefly discuss an option of making this simpler (maybe you already discussed this with @kvz). Why not make these two headers hard limits for the client, instead of recommendations? So, something like "If the Server specifies Tus-Max-Chunk-Size, the Client MUST NOT send a chunk that is larger than Tus-Max-Chunk-Size" (and the equivalent for Tus-Min-Chunk-Size).

Because the Client currently cannot know if these are just recommendations or hard limits, so I feel like it should treat them as hard limits anyway just in case. The client could theoretically be implemented in the way that, if the bandwidth is good/bad, it tries to upload a chunk outside of a limit, and if the chunk fails it realizes that this limit is a hard limit, and next time it doesn't try. However, in tus-ruby-server I can let the client know that they've hit the limit only after the chunk has already been uploaded, so depending on the chunk size this could bring a noticeable slowdown for upload start.

What do you think about this?

@kvz Thanks! ā¤ļø

janko avatar Aug 30 '16 18:08 janko

@kvz: As part of this pr, I think adding @janko-m as a contributor and a feature version bump could be cool

This will not be done in this PR, but in the one for the 1.1 release.

@janko-m: Great work!

Thank you for initiation this change :)

@janko-m: Why not make these two headers hard limits for the client, instead of recommendations? So, something like "If the Server specifies Tus-Max-Chunk-Size, the Client MUST NOT send a chunk that is larger than Tus-Max-Chunk-Size" (and the equivalent for Tus-Min-Chunk-Size).

I dislike this idea because this would push to much power to the Server. As we seemed to agree in #89, the Client should always have the last word in determining how big a single chunk will be and, in my opinion, this also includes having the ability and right to choose a size which is outside of the numbers outlined by the Tus-Min/Max-Chunk-Size headers. Such a scenario probably will not appear often in reality but the may turn up, for example when the bandwidth is just too bad or the device's memory is too small.

The client could theoretically be implemented in the way that, if the bandwidth is good/bad, it tries to upload a chunk outside of a limit, and if the chunk fails it realizes that this limit is a hard limit, and next time it doesn't try.

Of course, Clients could include this functionality but I am not pleased to push authors towards implementing Clients which violate the specification, if that's avoidable.

Because the Client currently cannot know if these are just recommendations or hard limits, so I feel like it should treat them as hard limits anyway just in case.

I agree, that we should include a statement which recommends to use these numbers as hard limits, if possible. Please refer to my two example for above for cases in which to may not be able to do.

However, in tus-ruby-server I can let the client know that they've hit the limit only after the chunk has already been uploaded, so depending on the chunk size this could bring a noticeable slowdown for upload start.

If the Client chooses a chunk size which is outside of the recommended range, it should be prepared to see a rejection for them. Furthermore, this is a known issue in the HTTP architecture and is also reason why they introduced the Expect: 100-continue request header and the 100 Continue status code. The proper use of these allows the Server to tell the Client whether it's going to accept a request based on the request's header without receiving the body. This should allow you to check the bodies' size before transferring the chunk.

I am looking forward to your feedback. Also, @kvz and @AJvanLoon, thank you a lot for your thoughts, I will address them! :heart:

Acconut avatar Aug 31 '16 21:08 Acconut

I dislike this idea because this would push to much power to the Server. As we seemed to agree in #89, the Client should always have the last word in determining how big a single chunk will be and, in my opinion, this also includes having the ability and right to choose a size which is outside of the numbers outlined by the Tus-Min/Max-Chunk-Size headers. Such a scenario probably will not appear often in reality but the may turn up, for example when the bandwidth is just too bad or the device's memory is too small.

You're right, I agree.

I agree, that we should include a statement which recommends to use these numbers as hard limits, if possible.

That would be perfect šŸ‘

janko avatar Sep 01 '16 04:09 janko

the Client should always have the last word in determining how big a single chunk will be and, in my opinion, this also includes having the ability and right to choose a size which is outside of the numbers outlined by the Tus-Min/Max-Chunk-Size headers

I'm not 100% sure about this one. I could imagine cases where the Server knows upper or lower bound limits to the chunksize it can accept. I think the Client should have all the freedom to estimate best chunksize within this range. But what good would it do leave the Client the freedom to overshoot the Server's boundaries? The protocol then allows for more errors to happen than would have been the case when making the client respect its ranges always. Let the record show that the server could have very wide ranges by default. Even say from 0 to -1, meaning unlimited. But when the Server author decides to limit it for whatever practicality, it would be nice to know that all clients will adhere to these 'speedlimits', while still having the freedom to go lower or higher than average.

kvz avatar Sep 01 '16 08:09 kvz

But what good would it do leave the Client the freedom to overshoot the Server's boundaries?

Servers will always need a good protection against these situation, regardless of whether they come from a misbehaving Client or a malicious DDoS attack. The Server is allowed to reject chunks which fall outside of the recommended size and Clients should be aware of this fact. They may achieve this partially using the Expect: 100-continue request header and the 100 Continue status code.

Furthermore, I think in most cases, the device running the Client will be more limited in terms of available hardware than the Server. Especially if you look at smartphones, embedded devices or micro computers running in rural areas in Africa, which we saw experimenting with tus some time ago. Therefore, the Client will usually be the entity which slows the upload down, I assume.

But when the Server author decides to limit it for whatever practicality, it would be nice to know that all clients will adhere to these 'speedlimits',

I agree, that's why we should include a statement which recommends to use these numbers as hard limits, if possible. Please see me comment above for more context.

while still having the freedom to go lower or higher than average.

Could you please explain what you mean with average? Do you refer to the recommended upper and lower chunk sizes?

Acconut avatar Sep 01 '16 08:09 Acconut

DDoS-ing aside, for which I feel protection has limited place inside tusd, but rather in the HAProxy or network layers above it,

The Server is allowed to reject chunks which fall outside of the recommended size and Clients should be aware of this fact.

How will the client know what the boundaries are if the protocol suggests the client may overshoot them.

recommends to use these numbers as hard limits, if possible

I feel this is ambiguous. The client should treat the server upper & lower bounds as hard limits that it is not allowed to overshoot. The server is free to set these values to actual hard limits, or to a vague estimation of them. It's just nice to set the range that clients will try from the serverside.

I know that in Africa phones are low on bandwidth and all that, the case i think is more that when servicing a million devices, you can squeeze more performance out of a single tusd server if you can control the range in which they set their chunks. If you just provide a vague recommendation, chances are half of your devices are just going to ignore it, since the protocol isn't tough on this, meaning you have the choice of either erroring out hard to influence behavior, at the risk of which the clients don't automatically adjust, basically denying them service.

kvz avatar Sep 01 '16 09:09 kvz

As somebody who has spent a few hours scratching his head trying to figure out why setting a client-side chunk size completely broke uploads, only to eventually figure out the failures were because I was sending chunks smaller than S3 accepts, I'm definitely in favour of a way for the server to at least suggest an acceptable chunk size range to the client. The presence of a min chunk size header would've been incredibly helpful in determining the issue quickly!

For reference, the S3 docs specify the following chunk size:

5 MB to 5 GB, last part can be < 5 MB

(emphasis added). The latter part of that makes the discussion around whether the headers should be hard or soft limits a little more complicated - for instance, if they were to be hard limits, how should a client act if presented with a file which can't be split into chunks which all fit within the server-specified min/max? Although that seems like a reasonably unlikely situation, it's worth keeping in mind - even if only to document it. A couple of options spring to mind:

  • The simplest option would probably be for the protocol to specify that the last chunk can always be smaller than the minimum chunk size, though that might risk alienating any storage back end which doesn't have the "last part can be smaller than min chunk size" stipulation which S3 has (if any such storage back end exists)
  • Alternatively, another new header (along the lines of Tus-Accepts-Small-Last-Chunk) could be introduced to cover all bases: if the server is using S3 (for instance), it can send this header to say that undersized final chunks are accepted. Since this is probably the more likely case, it might be better to invert this to something like Tus-No-Undersized-Chunks for any server which doesn't allow the last chunk to be under the limit

Thinking further on the header option, perhaps it could be used as a way for the server to say whether the limits are soft or hard:

  • Tus-Undersized-Chunks could be:
    • omitted or any to say that Tus-Min-Chunk-Size is a soft limit,
    • none to say that Tus-Min-Chunk-Size is a hard limit (with no exceptions), or
    • last to say that Tus-Min-Chunk-Size is a hard limit (with the exception of the last chunk, per S3)
  • Tus-Oversized-Chunks could be:
    • omitted or any to say that Tus-Max-Chunk-Size is a soft limit, or
    • none to say that Tus-Max-Chunk-Size is a hard limit

pauln avatar Nov 19 '17 22:11 pauln

To make sure I understand: tus-js-client needs to send big enough chunks so that e.g. tusd can send those same, big enough chunks, to s3?

If so I would suggest these should be decoupled at the expense of buffering on the tusd server. You should be able to define chunk sizes from tus-js-client towards tusd regardless of storage backend imho.

But perhaps Iā€™m misunderstanding this issue in which case, please clarify further :)

Sent from mobile, pardon the brevity.

On 19 Nov 2017, at 23:37, Paul Nicholls [email protected] wrote:

As somebody who has spent a few hours scratching his head trying to figure out why setting a client-side chunk size completely broke uploads, only to eventually figure out the failures were because I was sending chunks smaller than S3 accepts, I'm definitely in favour of a way for the server to at least suggest an acceptable chunk size range to the client. The presence of a min chunk size header would've been incredibly helpful in determining the issue quickly!

For reference, the S3 docs specify the following chunk size:

5 MB to 5 GB, last part can be < 5 MB

(emphasis added). The latter part of that makes the discussion around whether the headers should be hard or soft limits a little more complicated - for instance, if they were to be hard limits, how should a client act if presented with a file which can't be split into chunks which all fit within the server-specified min/max? Although that seems like a reasonably unlikely situation, it's worth keeping in mind - even if only to document it. A couple of options spring to mind:

The simplest option would probably be for the protocol to specify that the last chunk can always be smaller than the minimum chunk size, though that might risk alienating any storage back end which doesn't have the "last part can be smaller than min chunk size" stipulation which S3 has (if any such storage back end exists) Alternatively, another new header (along the lines of Tus-Accepts-Small-Last-Chunk) could be introduced to cover all bases: if the server is using S3 (for instance), it can send this header to say that undersized final chunks are accepted. Since this is probably the more likely case, it might be better to invert this to something like Tus-No-Undersized-Chunks for any server which doesn't allow the last chunk to be under the limit Thinking further on the header option, perhaps it could be used as a way for the server to say whether the limits are soft or hard:

Tus-Undersized-Chunks could be: omitted or any to say that Tus-Min-Chunk-Size is a soft limit, none to say that Tus-Min-Chunk-Size is a hard limit (with no exceptions), or last to say that Tus-Min-Chunk-Size is a hard limit (with the exception of the last chunk, per S3) Tus-Oversized-Chunks could be: omitted or any to say that Tus-Max-Chunk-Size is a soft limit, or none to say that Tus-Max-Chunk-Size is a hard limit ā€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

kvz avatar Nov 20 '17 08:11 kvz

To make sure I understand: tus-js-client needs to send big enough chunks so that e.g. tusd can send those same, big enough chunks, to s3?

Yes - at least in the case of tusd (other implementations may be able to handle this scenario already).

If so I would suggest these should be decoupled at the expense of buffering on the tusd server.

Buffering might interfere with tusd's ability to horizontally scale, so I'm not sure whether that's likely to be implemented.

You should be able to define chunk sizes from tus-js-client towards tusd regardless of storage backend imho.

That would indeed be ideal, but as long as there's a way for the Tus server (i.e. tusd) to inform the client about (hard) limits, I don't think it's critical.

pauln avatar Nov 20 '17 09:11 pauln

If so I would suggest these should be decoupled at the expense of buffering on the tusd server.

As @pauln said, this is not an easy solution. Buffering on disk will break horizontal scalability (unless you use sticky sessions) and storing the buffers on S3 would probably be too slow and may reduce the performance for all other uploads, or would at least make the algorithm much more complex.

@pauln Thank you very much, for your thoughts on this. I am sorry to hear that this consumed much of your time. Actually, this fact is documented in tusd (https://github.com/tus/tusd/blob/master/s3store/s3store.go#L52) but one cannot be expected to find this.

Regarding your proposal, I would like to mention that the S3 store in tusd has flexible minimum chunk sizes, which are calculated for each upload size individually. For example, if you want to upload 5TB, then the min. size will be 500MB (the reason is that S3 only allows 10.000 chunks for each multipart upload). How would this play with your idea? Would the 5MB limit (which holds true for some uploads) be considered a soft limit?

Acconut avatar Nov 20 '17 23:11 Acconut

S3 store in tusd has flexible minimum chunk sizes, which are calculated for each upload size individually [...] How would this play with your idea?

Assuming that the min/max chunk sizes would be send in response to the Creation plugin's POST request, the file size is already sent in the Upload-Length header - which could then be used to determine / calculate what to send back.

Upload-Defer-Length would interfere with using Upload-Length in this way - but if a server has flexible minimum chunk sizes, perhaps it simply shouldn't support creation-defer-length. Alternatively, it could start off sending the absolute hard limit (i.e. 5MB for S3) and then send revised min/max chunk sizes in response to the PATCH which sets the Upload-Length header (once the length is known) - though that could theoretically cause problems if a very large file's length is deferred. If that's likely to be an issue (such as for tusd using the S3 store), perhaps the server could track relevant metrics and add a revised min chunk size to any arbitrary PATCH request?

Alternatively, if it's desirable to push this kind of logic onto the client as much as possible, perhaps it would be better for the POST response to include some relevant combination of:

  • Tus-Min-Chunk-Size
  • Tus-Max-Chunk-Size
  • Tus-Max-Chunks-Per-File
  • Tus-Undersized-Chunks (as per my earlier comment)
  • Tus-Oversized-Chunks (as per my earlier comment)

It would then be up to the client to track what it's sending (and needs to send) and adjust chunk sizes as necessary.

pauln avatar Nov 23 '17 22:11 pauln

Based on my experience with large file transfers with native desktop and mobile apps, Tus-Max-Chunk-Size is very much needed. There's too much middleware between clients and backend, that enforces size limits, but don't play nicely with tus.

Backends should know if they're behind load balancers, network security appliances, tunnel endpoints etc, and should announce Tus-Max-Chunk-Size to connecting clients.

Clients are free to optimise chunk sizes between Tus-Min-Chunk-Size and Tus-Max-Chunk-Size, based on other criteria.

Some real-world-examples were already described in other places:

  • Reverse proxies come with maximum request size (Source: https://github.com/tus/tusd/blob/master/docs/faq.md#can-i-run-tusd-behind-a-reverse-proxy)
  • Cloudflare has a 100mb hard limit (HTTP POST request size) (Source: https://github.com/tus/tusd/issues/353#issuecomment-590526373

michaelstingl avatar Mar 18 '21 22:03 michaelstingl