coreos-assembler
coreos-assembler copied to clipboard
platform/azure/instance.go: make CreateOrUpdate timeout aware
In a recent test run in FCOS CI, we noticed function CreateOrUpdate hung and took 30 minutes to terminate. Let's gate this piece of code with a timeout.
/retest
Saw something similar today in https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/kola-azure/142/.
[2022-07-11T21:30:09.575Z] --- PASS: rpmostree.upgrade-rollback (791.77s)
[2022-07-11T21:30:09.575Z] --- PASS: rpmostree.upgrade-rollback/upgrade (205.16s)
[2022-07-11T21:30:09.575Z] --- PASS: rpmostree.upgrade-rollback/rollback (172.31s)
[2022-07-11T21:43:46.179Z] 2022-07-11T21:43:41Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=200 -- Original Error: Long running operation terminated with status 'Failed': Code="OSProvisioningTimedOut" Message="OS Provisioning for VM 'kola-b5d64948-0851c4825e' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\r\n * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \r\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \r\n * If you are deploying more than 20 Virtual Machines concurrently, consider moving your custom image to shared image gallery. Please refer to https://aka.ms/movetosig for the same."
[2022-07-11T21:50:43.948Z] Sending interrupt signal to process
Let's get together soon to get your Azure set up so we can test this PR and get it in.
Ok for posterity the original error we saw was in kola-azure#57
(link now expired) from April 30th:
harness.go:1346: Cluster failed: creating storage account:
storage.AccountsClient#Create: Failure sending request: StatusCode=504 --
Original Error: Long running operation terminated with status 'Failed':
Code="GatewayTimeout" Message="The gateway did not receive a response from Microsoft.Storage' within the specified time period."
Just saw another long running operation today (kola-azure#50) that took a while to time out and caused the whole job to timeout at the jenkins level:
kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate:
Failure sending request: StatusCode=200 --
Original Error: Long running operation terminated with status
'Failed': Code="OSProvisioningTimedOut" Message="OS Provisioning for VM 'kola-48435b97-33c94b1521' did not finish in the allotted time.
The VM may still finish provisioning successfully. Please check provisioning state later.
Also, make sure the image has been properly prepared (generalized).\r\n * Instructions
for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/
\r\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/
\r\n * If you are deploying more than 20 Virtual Machines concurrently, consider moving your custom
image to shared image gallery. Please refer to https://aka.ms/movetosig for the same."
I'll try to resolve remaining issues with this PR and test it so we can get it merged.
and in kola-azure#52 we saw some similar issues that caused the job to eventually time out at the jenkins level:
[2022-09-06T21:54:03.396Z] ⚠️ Skipping kola test pattern "fcos.internet":
[2022-09-06T21:54:03.396Z] 👉 https://github.com/coreos/coreos-assembler/pull/1478
[2022-09-06T21:54:03.396Z] ⚠️ Skipping kola test pattern "podman.workflow":
[2022-09-06T21:54:03.396Z] 👉 https://github.com/coreos/coreos-assembler/pull/1478
[2022-09-06T21:54:03.396Z] === RUN coreos.selinux.boolean
[2022-09-06T21:54:03.396Z] === RUN coreos.selinux.enforce
[2022-09-06T21:54:03.396Z] === RUN ext.config.gshadow
[2022-09-06T21:54:03.396Z] === RUN fcos.metadata.azure
[2022-09-06T21:54:03.396Z] === RUN fcos.filesystem
[2022-09-06T21:54:03.396Z] === RUN fcos.users.shells
[2022-09-06T21:54:03.396Z] === RUN basic
[2022-09-06T21:54:03.396Z] === RUN rootfs.uuid
[2022-09-06T21:54:03.396Z] === RUN coreos.tls.fetch-urls
[2022-09-06T21:54:03.396Z] === RUN ostree.remote
[2022-09-06T21:54:03.396Z] === RUN rhcos.selinux.boolean.persist
[2022-09-06T21:54:03.396Z] === RUN ext.config.firewall.iptables-legacy
[2022-09-06T21:54:03.396Z] === RUN ostree.hotfix
[2022-09-06T21:54:03.396Z] === RUN rpmostree.install-uninstall
[2022-09-06T21:54:03.396Z] === RUN ext.config.swap.zram-generator
[2022-09-06T21:54:03.396Z] === RUN fcos.ignition.misc.empty
[2022-09-06T21:54:03.396Z] === RUN coreos.ignition.groups
[2022-09-06T21:54:03.396Z] === RUN coreos.ignition.security.tls
[2022-09-06T21:54:03.396Z] === RUN ext.config.kdump.crash
[2022-09-06T21:54:03.396Z] === RUN coreos.ignition.resource.local
[2022-09-06T21:54:03.396Z] === RUN fcos.network.listeners
[2022-09-06T21:54:03.396Z] === RUN coreos.ignition.v2.users
[2022-09-06T21:54:03.396Z] === RUN podman.base
[2022-09-06T21:54:03.396Z] === RUN podman.network-single
[2022-09-06T21:54:03.396Z] === RUN coreos.ignition.once
[2022-09-06T21:54:03.396Z] === RUN fcos.ignition.v3.noop
[2022-09-06T21:54:03.396Z] === RUN ext.config.ignition.resource.remote
[2022-09-06T21:54:03.396Z] === RUN coreos.auth.verify
[2022-09-06T21:54:03.396Z] === RUN ext.config.ignition.remote
[2022-09-06T21:54:03.396Z] === RUN coreos.ignition.instantiated.enable-unit
[2022-09-06T21:54:03.396Z] === RUN rpmostree.status
[2022-09-06T21:54:03.396Z] === RUN ostree.unlock
[2022-09-06T21:54:03.396Z] === RUN non-exclusive-test-bucket-0
[2022-09-06T21:54:03.396Z] === RUN coreos.ignition.journald-log
[2022-09-06T21:54:03.396Z] === RUN coreos.ignition.ssh.key
[2022-09-06T21:54:03.396Z] === RUN ext.config.ignition.kargs
[2022-09-06T21:54:03.396Z] === RUN rpmostree.upgrade-rollback
[2022-09-06T22:18:10.905Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/8f40c57d-8041-4db8-ae81-dbf5e70fa8e6?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.905Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/77d91752-7c4d-4e9b-8fb9-4ea07b5ee94c?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.905Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/1160eae0-552d-4ac7-b593-033407cf4ee1?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.905Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/a8a59671-41c0-4901-b938-bf8c629766f9?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.906Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/1da6017f-53f7-4b88-8dcf-e31cd77df8da?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.906Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/6d3101f4-5bb5-4468-ba93-123fbe991c9f?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.906Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/174b7d35-799d-4ae4-9908-bbcd79021892?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.906Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/0b9f0901-68ec-4899-b90a-5c9e3a320c8b?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.906Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/5197a407-e905-4479-ace6-e4b59609a17d?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.906Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/fc2dcf32-0c1b-4f61-9f00-e00bc47d0708?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
So you can see it took the API calls about 24 minutes to time out.
I verified basic functionality here. I didn't try to artificially reproduce a timeout failure. Hoping we can ship this anyway.
Tested with a super short timeout of 10s
and it seems to be working (I think):
=== RUN non-exclusive-test-bucket-0
2022-09-07T03:00:44Z kola: retryloop: failed to bring up machines: creating instance failed: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest: AfterDelay canceled before full delay
2022-09-07T03:00:57Z kola: retryloop: failed to bring up machines: creating instance failed: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest: AfterDelay canceled before full delay