diego-release icon indicating copy to clipboard operation
diego-release copied to clipboard

Add delay logic between Rep's retries to download droplets

Open vlast3k opened this issue 1 year ago • 1 comments

Summary

When Rep is downloading droplets from a blobstore in some cases the Hyperscaller may apply throttling. E.g. Azure has a limit of ~100-150 Gbps, and as soon as this threshold is reached some HTTP Requests are terminated with "503 ServerBusy" so that the maximum bandwidth is not exceeded. For some reason they aren't just reducing the download speed of all of the connections, but just terminating some of them. Also they aren't responding with 429, which is the standard but 503.

We tried to workaround this by decreasing the diego.executor.max_concurrent_downloads to 2, but there was no improvement (we are updating ~45 cells in parallel). For now we will decrease the max_in_flight property, but this is rather a temporary solution and will increase the update time.

This is why we think it will be good to change the code which handles the retires in case of failure. It seems to be here https://github.com/cloudfoundry/cacheddownloader/blob/master/downloader.go#L213-L215 And add some delay in case of 429 (eventually by processing also "Retry-After" header) and 503 ServerBusy (specifically for Azure).

We should discuss to what extent this should be configurable:

  • Plain on/off switch to enable/disable the functionality
  • or Configurable delay, even some randomness
  • or just add some preset delay of e.g. 5 seconds in case of those errors appearing

Diego repo

https://github.com/cloudfoundry/cacheddownloader

Describe alternatives you've considered (optional)

  • decrease diego.executor.max_concurrent_downloads from 5 to 2 - for some reason this did not help. Assumption is that Azure is summing up the downloaded data for a certain amount of time and regardless of the number of threads, it reaches the limit
  • decrease max_in_flight - this will be our current workaround, though this will increase the update time
  • use bigger VMs for the diego cells, so that we update less of them in parallel - this is something that we are currently working on, but is also a temporary solution

Additional Text Output, Screenshots, or contextual information (optional)

Please add any other context, slack conversations, log files, code snippets, or screenshots that would help us understand the request.

vlast3k avatar Jul 14 '22 07:07 vlast3k

Hi @jrussett , As proposed i created the issue. We will prepare the PR in the next weeks, but in parallel i wanted to align on some technical details. As outlined above:

We should discuss to what extent this should be configurable:

  • Plain on/off switch to enable/disable the functionality and no option to modify the behaviour
  • or Configurable delay, even some randomness
  • or just add some preset delay of e.g. 5 seconds in case of those errors appearing

Regards, Vladimir

vlast3k avatar Jul 14 '22 07:07 vlast3k