sdk
sdk copied to clipboard
Use a more reasonable timeout for HttpClient operations for Container Registry interactions
Tentative minimal fix for https://github.com/dotnet/sdk-container-builds/issues/543
Description
Users with saturated, low, or intermittent network speeds sometimes would hit timeouts when pushing large container image layers due to our use of the default HttpClient timeout of 100 seconds. This fix is a simple, targeted fix for 8.0.2xx that extends the timeouts while we work on a longer term fix for 8.0.3xx and beyond. We've estimated layer sizes and determined a reasonable top limit that would allow large layers to be uploaded without leading to infinite timeouts.
Customer Impact
Customers uploading large layers (for example, a self-contained, untrimmed application) would hit unavoidable upload errors if the upload took over 100s.
Regression?
No - we've had this in a few different scenarios before, and we've done per-registry probing for compat quirks, chunking, etc to mitigate. This is one more such mitigation.
Risk
Low - this only impacts container publish and is a very targeted fix.
I've been able to use this to push to ACR, but local registry and ECR are returning socket hang-ups. This might be due to my http interception.
@stephentoub @MihaZupan Can you review? Are there any other timeouts we should set to infinite?
Are there any other timeouts we should set to infinite?
I think that's the only one that'd be relevant here. SocketsHttpHandler has a ConnectTimeout that's relevant (in general, not to long uploads), but it defaults to infinite.
Thanks for the check! I'm trying to do some due diligence with pushing to a variety of registries before I mark this as ready and do the servicing template.
Infinite is a long time, I'm afraid users may start running into weird "hangs" when hitting networking issues.
If you want to keep the change simple, I would at least also set the SocketsHttpHandler.ConnectTimeout (e.g. to 30 seconds) to avoid some of those scenarios.
When it comes to low bandwidth transfers, Timeout can be problematic when either
- a) uploading large contents, or
- b) when buffering large downloads. If you're using
ResponseHeadersReadwhen downloading,Timeoutwon't apply to the content transfer so it's not an issue there.
From the issues, it sounds like you're hitting problems with slow uploads. If you're comfortable with making a bigger change here, it may be less risky to add logic to detect completely stalled uploads instead of just not having a timeout at all. That is, you probably still want to time out the request if the server never responds at all, or stops accepting data mid-transfer.
I can definitely add a separate Sockets timeout! I think we're getting enough feedback that I'd like to get this as a servicing fix, so minimal changes, but I'd be interested in a more comprehensive, responsive treatment for 8.0.300.
I'm worried about setting it to infinite. But a much higher timeout is warranted. I think the way to calibrate it is to consider a very slow connection in our context (say 1Mbit/s) and think how long it would take to upload a .NET container image with a bunch of dependencies included - then add some buffer.
The full solution would then allow this to be overridden.
The largest images we would make would be
- a framework-dependent, debian-based, non-chiseled application, and
- a self-contained, untrimmed, debian-based, non-chiseled application
The self-contained example has the largest individual layer, which for the Orleans Aspire sample app comes out to more than 110MB, because it contains:
- .NET Runtime
- ASP.NET Runtime
- App and all dependencies
Let's estimate and say that non-trimmed, self-contained image layers could be roughly 150MB-200MB on the extreme end.
200 MB @ 1Mbps is almost 28 minutes (27:57). Should we call it a half-hour maximum upload timeout then?
1 Mbps is a good ballpark because by default we upload image layers in parallel - even in a self-contained image scenario we could be uploading 3-4 layers at once, leading to a slower time per individual layer.
It may be worth trying out if chunking isn't a good way to solve the problem.
It can be enabled on the released SDK through envvars: https://github.com/Azure/azure-dev/discussions/3212#discussioncomment-8553439.
It is also interesting to look at how "battle-tested" clients handle this.
I've been looking at regclient and skopeo throughout this process, but to be honest there's so much delegation and pointer-chasing that it's hard for me to develop a concrete description of how a request is executed. And at some point you get to the raw golang net.Transport which can have it's own defaults and I get lost entirely.
/backport to release/8.0.1xx
Started backporting to release/8.0.1xx: https://github.com/dotnet/sdk/actions/runs/9311586229
@baronfel backporting to release/8.0.1xx failed, the patch most likely resulted in conflicts:
$ git am --3way --ignore-whitespace --keep-non-patch changes.patch
Applying: use an infinite timeout for container registry operations
Applying: add a socket timeout
Using index info to reconstruct a base tree...
M src/Containers/Microsoft.NET.Build.Containers/Registry/DefaultRegistryAPI.cs
Falling back to patching base and 3-way merge...
Auto-merging src/Containers/Microsoft.NET.Build.Containers/Registry/DefaultRegistryAPI.cs
CONFLICT (content): Merge conflict in src/Containers/Microsoft.NET.Build.Containers/Registry/DefaultRegistryAPI.cs
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0002 add a socket timeout
Error: The process '/usr/bin/git' failed with exit code 128
Please backport manually!
@baronfel an error occurred while backporting to release/8.0.1xx, please check the run log for details!
Error: git am failed, most likely due to a merge conflict.