tusd icon indicating copy to clipboard operation
tusd copied to clipboard

Introduce the concept of an Upload Collection to rate-limit hook firing better

Open tim-kos opened this issue 5 years ago • 5 comments

Is your feature request related to a problem? Please describe. There is one hook fired per second per upload by default. In a multi-upload scenario (let's say a Transloadit Assembly with 50 files), up to 50 hooks are fired every second. This is a lot. draining resources.

Depending on what you want from the hooks, it would be enough to know how many bytes (in sum, across all upload files) were received, versus the sum of bytes expected. And then have this communicated at most once per second.

Describe the solution you'd like It would be really nice if we could introduce the concept of an upload collection, which has its own hooks. The upload collection would consist of:

  • ID
  • EXPECTED_NUM_OF_FILES
  • EXPECTED_BYTES
  • RECEIVED_BYTES
  • ?

... and come with its own hooks. It would fire the collection hooks as soon as a chunk of an upload of that collection was received, communicating these variables. But the hook would then also be fired at most every 1s.

Describe alternatives you've considered One could introduce something like that in the hook system itself and use a local data store like redis within the post-receive hooks to calculate how many bytes in total were received so far and if we need to notify the underlying system about a received chunk.

However, this would still mean the cli/http hooks would be invoked once per second per uploaded file. This is not so big of a problem for cli hooks, but for http hooks it could be hundreds if not thousands of requests in a short time span, just for one upload context.

It would be nice to have first class support for this and limit the number of hook invocations in the first place.

Can you provide help with implementing this feature? Possibly, yes.

Additional context None

tim-kos avatar Apr 18 '19 12:04 tim-kos

I have to say that this is a very Transloadit-specific request and I have never heard something comparable from another person. Especially in conjunction with #269, implementing this could get hairy but let's worry about that later. If I understand it correctly, the goal is to reduce the cost of delivering upload progress events to the application and there are basically two ways to achieve this:

Option A: Group multiple uploads into a collection by putting uploads with a shared meta data value in the same collection. After that, upload progress is only reported for the entire group as you mentioned, bringing down the number of total hook executions.

Option B would be to reduce the cost of a single hook execution. You mention that HTTP hooks would hammer an endpoint, however, I believe that HTTP/2 might be a solution here. HTTP/2 uses a single, multiplexed TCP connection meaning that multiple request/responses can be transmitted at the same time. Requests also do not need a new connection (as is already the case using HTTP/1.1 keep alive). Furthermore, the protocol is binary and not text-based so the parsing overhead is even less. If it's possible that tusd sends a HTTP/2 request to the application directly (i.e. without it getting converted to HTTP/1.1), that might be a possible alternative as the cost of a HTTP/2 request is significantly lower than for HTTP/1.1.

One could introduce something like that in the hook system itself and use a local data store like redis within the post-receive hooks to calculate how many bytes in total were received so far and if we need to notify the underlying system about a received chunk.

I don't think that that's a good solution in conjunction with #269 where multiple hooks depend on the response from the underlying system to determine which exit code they should return. Achieving that is possible but likely too complicated.

Acconut avatar Apr 29 '19 22:04 Acconut

We have a similar situation where a bunch of users are uploading multiple files at the same time. the post-receive hook is kicked off once per second per concurrent upload. For hundreds of large files being uploaded concurrently, that's hundreds hooks per second for quite a long time, just for the post-receive hook alone.

For me I'd prefer a way of rate-limiting the hooks being fired on a per-hook basis by making the hard-coded once per second maximum per file a configurable value controlled by a series of environment variables, or a single environment variable that takes in a comma delimited string. This would allow control per environment, per hook, while still getting the benefit of tracking the upload progress.

SomeDeveloper13 avatar Nov 17 '22 21:11 SomeDeveloper13

@SomeDeveloper13 The tusd v2 branch already contains a flag for controlling the interval of the post-receive hook: https://github.com/tus/tusd/blob/aff352c413066bc5bfa53f6998024b5ef29240f3/cmd/tusd/cli/flags.go#L93

That might be interesting for you.

Acconut avatar Nov 21 '22 13:11 Acconut

I have a similar situation where each user is uploading hundreds of files (sometimes small, sometimes big, depending on what they are doing). An upload collection would greatly reduce the number of requests we get from hooks.

mshunjan avatar Feb 13 '23 18:02 mshunjan

Thank you for your feedback, @mshunjan. We have currently no plans to implement this right now, but we will consider it in the future.

Acconut avatar Mar 26 '23 21:03 Acconut