sdk icon indicating copy to clipboard operation
sdk copied to clipboard

Use a more reasonable timeout for HttpClient operations for Container Registry interactions

Open baronfel opened this issue 1 year ago • 10 comments
trafficstars

Tentative minimal fix for https://github.com/dotnet/sdk-container-builds/issues/543

Description

Users with saturated, low, or intermittent network speeds sometimes would hit timeouts when pushing large container image layers due to our use of the default HttpClient timeout of 100 seconds. This fix is a simple, targeted fix for 8.0.2xx that extends the timeouts while we work on a longer term fix for 8.0.3xx and beyond. We've estimated layer sizes and determined a reasonable top limit that would allow large layers to be uploaded without leading to infinite timeouts.

Customer Impact

Customers uploading large layers (for example, a self-contained, untrimmed application) would hit unavoidable upload errors if the upload took over 100s.

Regression?

No - we've had this in a few different scenarios before, and we've done per-registry probing for compat quirks, chunking, etc to mitigate. This is one more such mitigation.

Risk

Low - this only impacts container publish and is a very targeted fix.

baronfel avatar Feb 22 '24 01:02 baronfel

I've been able to use this to push to ACR, but local registry and ECR are returning socket hang-ups. This might be due to my http interception.

baronfel avatar Feb 22 '24 04:02 baronfel

@stephentoub @MihaZupan Can you review? Are there any other timeouts we should set to infinite?

davidfowl avatar Feb 22 '24 15:02 davidfowl

Are there any other timeouts we should set to infinite?

I think that's the only one that'd be relevant here. SocketsHttpHandler has a ConnectTimeout that's relevant (in general, not to long uploads), but it defaults to infinite.

stephentoub avatar Feb 22 '24 15:02 stephentoub

Thanks for the check! I'm trying to do some due diligence with pushing to a variety of registries before I mark this as ready and do the servicing template.

baronfel avatar Feb 22 '24 16:02 baronfel

Infinite is a long time, I'm afraid users may start running into weird "hangs" when hitting networking issues. If you want to keep the change simple, I would at least also set the SocketsHttpHandler.ConnectTimeout (e.g. to 30 seconds) to avoid some of those scenarios.

When it comes to low bandwidth transfers, Timeout can be problematic when either

  • a) uploading large contents, or
  • b) when buffering large downloads. If you're using ResponseHeadersRead when downloading, Timeout won't apply to the content transfer so it's not an issue there.

From the issues, it sounds like you're hitting problems with slow uploads. If you're comfortable with making a bigger change here, it may be less risky to add logic to detect completely stalled uploads instead of just not having a timeout at all. That is, you probably still want to time out the request if the server never responds at all, or stops accepting data mid-transfer.

MihaZupan avatar Feb 22 '24 17:02 MihaZupan

I can definitely add a separate Sockets timeout! I think we're getting enough feedback that I'd like to get this as a servicing fix, so minimal changes, but I'd be interested in a more comprehensive, responsive treatment for 8.0.300.

baronfel avatar Feb 22 '24 23:02 baronfel

I'm worried about setting it to infinite. But a much higher timeout is warranted. I think the way to calibrate it is to consider a very slow connection in our context (say 1Mbit/s) and think how long it would take to upload a .NET container image with a bunch of dependencies included - then add some buffer.

The full solution would then allow this to be overridden.

mitchdenny avatar Feb 24 '24 01:02 mitchdenny

The largest images we would make would be

  • a framework-dependent, debian-based, non-chiseled application, and
  • a self-contained, untrimmed, debian-based, non-chiseled application

The self-contained example has the largest individual layer, which for the Orleans Aspire sample app comes out to more than 110MB, because it contains:

  • .NET Runtime
  • ASP.NET Runtime
  • App and all dependencies

Let's estimate and say that non-trimmed, self-contained image layers could be roughly 150MB-200MB on the extreme end.

200 MB @ 1Mbps is almost 28 minutes (27:57). Should we call it a half-hour maximum upload timeout then?

1 Mbps is a good ballpark because by default we upload image layers in parallel - even in a self-contained image scenario we could be uploading 3-4 layers at once, leading to a slower time per individual layer.

baronfel avatar Feb 24 '24 18:02 baronfel

It may be worth trying out if chunking isn't a good way to solve the problem.

It can be enabled on the released SDK through envvars: https://github.com/Azure/azure-dev/discussions/3212#discussioncomment-8553439.

It is also interesting to look at how "battle-tested" clients handle this.

tmds avatar Feb 25 '24 06:02 tmds

I've been looking at regclient and skopeo throughout this process, but to be honest there's so much delegation and pointer-chasing that it's hard for me to develop a concrete description of how a request is executed. And at some point you get to the raw golang net.Transport which can have it's own defaults and I get lost entirely.

baronfel avatar Mar 02 '24 20:03 baronfel

/backport to release/8.0.1xx

baronfel avatar May 31 '24 01:05 baronfel

Started backporting to release/8.0.1xx: https://github.com/dotnet/sdk/actions/runs/9311586229

github-actions[bot] avatar May 31 '24 01:05 github-actions[bot]

@baronfel backporting to release/8.0.1xx failed, the patch most likely resulted in conflicts:

$ git am --3way --ignore-whitespace --keep-non-patch changes.patch

Applying: use an infinite timeout for container registry operations
Applying: add a socket timeout
Using index info to reconstruct a base tree...
M	src/Containers/Microsoft.NET.Build.Containers/Registry/DefaultRegistryAPI.cs
Falling back to patching base and 3-way merge...
Auto-merging src/Containers/Microsoft.NET.Build.Containers/Registry/DefaultRegistryAPI.cs
CONFLICT (content): Merge conflict in src/Containers/Microsoft.NET.Build.Containers/Registry/DefaultRegistryAPI.cs
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0002 add a socket timeout
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

github-actions[bot] avatar May 31 '24 01:05 github-actions[bot]

@baronfel an error occurred while backporting to release/8.0.1xx, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.

github-actions[bot] avatar May 31 '24 01:05 github-actions[bot]