cloud-provider-azure
cloud-provider-azure copied to clipboard
Lack of timeout in client calls causing 15m53s hang in update of load balancers
What happened:
We have been observing a lock up in the CCM that lasts for exactly 15m53s repeatedly through the OpenShift testing infrastructure. This has been happening approximately since we started using the 1.29 release of the CCM.
Digging into the detail, it happens specifically when updating the load balancer. Through instrumenting (with additional logging) I have managed to reproduce the issue and initially narrowed the hang down to CreateOrUpdateLoadBalancer in the azureclients pacakge.
Instrumenting more narrowed it specifically to createOrUpdateLoadBalancer. The function itself does very little except a PUT call out to the Azure APIs.
It seems that the call is heading out and the future is never getting cancelled, meaning we hang until something else happens, some sort of background reset?
This code is called from CreateOrUpdateLB in the provider package, and if you look at that, you can see that it's passing a context.Background() through.
There should be a timeout on all contexts passed to asynchnronous calls like this, but it appears it is missing here.
What you expected to happen:
All calls to external APIs should timeout in a reasonable timeframe and should not hang indefinitely.
How to reproduce it (as minimally and precisely as possible):
Seems sporadic, but we are seeing it approximately half the time in our E2E testing.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version): - Cloud provider or hardware configuration:
- OS (e.g:
cat /etc/os-release): - Kernel (e.g.
uname -a): - Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:
@JoelSpeed Good catch and thanks for reporting the issue. This is indeed a bug we should fix.
@feiskyer Any idea on how to proceed with this issue? I think personally we should be adding context WithTimeout throughout calls to external APIs, but I'm not sure what the appropriate value is for something like this particular issue, wondering if you have a good suggestion?
@MartinForReal has merged #5652 as first step. For next step, we should add timeouts in context.
@MartinForReal could you work a fix on this?
@JoelSpeed 1min timeout is added to client. Lets' see if it works as expected.