cloud-init icon indicating copy to clipboard operation
cloud-init copied to clipboard

`cloud-init status --long` reports `done degraded` and exits 2 on retries when accessing the Azure IMDS

Open renanrodrigo opened this issue 9 months ago • 2 comments

Bug report

On some very particular cases™, running on Azure, I found cloud-init on a done degraded state, and it messed up with the Pro Client CI. Those cases happen when launching:

  • Ubuntu Pro Noble (not the -gen1 suffixed image from the marketplace, the one in pycloudlib);
  • On a Standard_BS2 VM type;
  • When the instance happens to not have been pre-provisioned.
  • most of the time. sometimes it is just OK.

This is pretty much a corner case, yes, but it affects us in the sense that we run many tests on our CI, one instance per test, and the results are now flaky because we expect cloud-init to succeed and to be in a good state.

This happens because cloud-init cannot access the metadata/reprovisiondata?api-version=2019-06-01' on first try, so it retries at least once. But then, when the retry happens, a WARNING is issued and cloud-init status' rc becomes nonzero.

The thing is that it does manage to access the IMDS a little later when the retry happens, and everything works fine cloud-init wise (and even instance-wise)

Maybe this is not an issue. But if it is, there are two interesting things to explore imho:

  1. Just food for thought: It seems the Azure IMDS is taking a little longer, but not much, to respond on this particular instance type. Maybe that happens with other instance types? Or is there something wrong with the image? What may be causing that? Is that fixable or workaroundable? What are the downsides if this endpoint is actually unreachable?

  2. From a cloud-init perspective, should retries that end up succeeding degrade the status? Should those retries be softer-warnings, or even DEBUG entries?

Steps to reproduce the problem

Launch many Ubuntu Pro Noble instances on Azure, using the BS2 machine type. I did it using pycloudlib. The non-pre-provisioned ones (check with uptime or cloud-init analyze show) will present this issue most of the time. Im far from an expert and sorry if there are docs about this I didn't find - but if there is a way of requesting Azure to launch a proper fresh instance everytime this may be easier to reproduce.

Environment details

  • Cloud-init version: 24.1.3-0ubuntu3
  • Operating System Distribution: Ubuntu 24.04 LTS
  • Cloud provider, platform or installer type: Azure Standard_B2s VM
  • Cloud image (publisher:product:plan): Canonical:ubuntu-24_04-lts:ubuntu-pro:latest

cloud-init logs

2024-05-23 18:48:04,055 - azure.py[WARNING]: Polling IMDS failed attempt 1 with exception: UrlError('404 Client Error: Not Found for url: http://169.254.169.254/metadata/reprovisiondata?api-version=2019-06-01')
2024-05-23 18:48:05,064 - url_helper.py[DEBUG]: Read from http://169.254.169.254/metadata/reprovisiondata?api-version=2019-06-01 (200, 1920b) after 2 attempts

leads to

$ cloud-init status --long
status: done
extended_status: degraded done
boot_status_code: enabled-by-generator
last_update: Thu, 23 May 2024 18:48:20 +0000
detail: DataSourceAzure [seed=/dev/sr0]
errors: []
recoverable_errors:
WARNING:
	- Polling IMDS failed attempt 1 with exception: UrlError('404 Client Error: Not Found for url: http://169.254.169.254/metadata/reprovisiondata?api-version=2019-06-01')

$ echo $?
2

renanrodrigo avatar May 23 '24 19:05 renanrodrigo