cloud-init
cloud-init copied to clipboard
feat(disk_setup) add timeout
Proposed Commit Message
In a cloud environment, sometimes disks will attach while cloud-init is running and get missed. This adds a configurable timeout to wait for those disks.
Fixes GH-3386
LP: #1832645
Additional Context
This is https://github.com/canonical/cloud-init/pull/710 resurrected. Fixes #3386.
Test Steps
Tested with the following cloud-config file on azure:
disk_setup:
/dev/disk/by-lun/10:
layout: false
timeout: 60
fs_setup:
- filesystem: ext4
partition: auto
device: /dev/disk/by-lun/10
label: jenkins
mounts:
- ["/dev/disk/by-label/jenkins", "/var/lib/jenkins"]
I used a azurerm_virtual_machine_data_disk_attachment
terraform resource, which attaches the disk halfway during bootup, due to https://github.com/hashicorp/terraform-provider-azurerm/issues/6117, but this should also work with setting a reasonably large timeout and manually attaching the disk (at that lun)
Checklist
- [x] My code follows the process laid out in the documentation
- [ ] I have updated or added any unit tests accordingly
- [x] I have updated or added any documentation accordingly
Merge type
- [x] Squash merge using "Proposed Commit Message"
- [ ] Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)
Ugh. This problem demonstrates that cloud-init's current code in this area is a mess.
The proposed code might work here, but I can't say that I like it.
- timeouts -> unreliable, ugly
- this makes cloud-init block remaining cloud-init runtime (and all of boot) until the disk is ready
Plus, if you're using a systemd-based instance, you should be able to accomplish this already using systemd's builtin mount point generation from fstab. Something like this should do it:
mounts:
- ["/dev/disk/by-label/jenkins", "/var/lib/jenkins", "ext4", "defaults,nofail,x-systemd.makefs"]
Since systemd creates mount units from fstab, and mount units get automatically ordered, I think the above should "just work".
At least, that would work if cloud-init's cc_mount
module didn't assume that mounts need to be available when cloud-init.service
is running in order for them to get written into /etc/fstab
. I really don't think that cloud-init's cc_mounts.py::sanitize_devname()
should be booting out unavailable devices. If systemd can do the job correctly (and cloud-init can't), cloud-init should step out of the way and allow the user to write an fstab entry that gets the job done.
Thanks for proposing this @flokli. This might be an interim solution that we can take, but I think a more elegant solution is desirable long term.
Using systemd-fstab-generator
unfortunately won't work at all, no - it's the approach I tried before.
The fstab line creates a .mount
unit which will depend on a .device
unit that doesn't exist while systemd plans the transaction, because there's no corresponding /dev/disk/by-lun/10
while the disk is not attached yet - so all units with RequiresMountsFor=
- either explicit or implicit via StateDirectory
will fail due to a missing dependency.
This is not an issue on the second boot, or if I re-attempt to reach mutli-user.target (or just restart the .service unit using the data disk), but the only way to get this working straight from the first boot was to do mounting via cloud-init, not systemd.
Let me know if you'd be fine to accept this, happy to then take another pass and fix the linters.
The fstab line creates a .mount unit which will depend on a .device unit that doesn't exist while systemd plans the transaction, because there's no corresponding /dev/disk/by-lun/10 while the disk is not attached yet - so all units with RequiresMountsFor= - either explicit or implicit via StateDirectory will fail due to a missing dependency.
You're right that depending on only the generator output wouldn't work for first boot. I didn't consider initial transaction failure due to missing device unit.
However, the .device file should later be generated by systemd-udev[1], which will cause subsequent transaction re-calculation to succeed. I think that this is why restarting the .service unit and second boot and re-attempting multi-user all worked. I think that systemctl daemon-reload
should also do that anytime after the .device file is created.
An alternative that could take advantage of better systemd ordering would be to use ENV{SYSTEMD_WANTS}+="your-fs.mount"
[2] in a udev rule that matches on the specified block device. This seems like something that systemd-fstab-generator
could feasibly do: a special mount option with a name like x-systemd.first-boot
or x-systemd.new-device
which generates a udev rule for the device which activates the .mount when available - the rule could be disabled / deleted if the .device file already exists.
This is not an issue on the second boot, or if I re-attempt to reach mutli-user.target (or just restart the .service unit using the data disk), but the only way to get this working straight from the first boot was to do mounting via cloud-init, not systemd.
[1] systemd.device
systemd will dynamically create device units for all kernel devices that are marked with the "systemd" udev tag (by default all block and network devices, and a few others).
These properties may be used to activate arbitrary units when a specific device becomes available.
Your proposal should be easy to review, so for now if you want to move forward with this approach, please make sure to add the new key to the jsonschema.
I added the field to the jsonschema, addressed the linter warning and renamed need
to missing
to be more consistent with the other users of this method. PTAL.
@flokli I don't see your username flokli
connected to any users that have signed the CLA. Could you please do that so that we can accept this contribution to cloud-init?
Sorry, please check again, just did it.
@flokli Sorry about this one more thing. I just noticed that commit 9b2e3dc907dc06d0a2abdaae has @nibalizer as the author, but @flokli's DCO. This project requires CLA, not DCO. Whose code is this, @flokli's or @nibalizer's? If @nibalizer, I need you to sign the CLA please.
Apologies for the hold up on this. I wish this wasn't part of the process but unfortunately it is.
Correct, the changes were resurrected from @nibalizer's PR, that's mentioned in the PR description:
This is https://github.com/canonical/cloud-init/pull/710 resurrected.
This project requires CLA, not DCO. Whose code is this, @flokli's or @nibalizer's? If @nibalizer, I need you to sign the DCO please.
I assume you need CLA not DCO.
I opened this PR while at IBM. At the time I don't think IBM signed a CCLA with Canonical. I no longer work there. Is it possible we could get this in under "trivial commit" ? Or perhaps all this code is authored by @flokli who can sign the CLA.
cc @jjasghar @bradtopol
I realized you actually can change the timeout of .device by using the x-systemd.device-timeout
mount option in your fstab line, or the DefaultDeviceTimeout=
systemd.conf
setting, so I might not be needing this workaround after all, at least for my specific usecase.
I realized you actually can change the timeout of .device by using the
x-systemd.device-timeout
mount option in your fstab line, or theDefaultDeviceTimeout=
systemd.conf
setting, so I might not be needing this workaround after all, at least for my specific usecase.
@flokli Let me know if that works. It looks like the existence of the .device unit is required for this command (much like x-systemd.makefs
that I pointed out above), so I'm guessing that this won't help since the transaction will still fail to include the mount unit because the device unit still won't exist on first boot.
I assume you need CLA not DCO.
Yes, that's what I meant.
I opened this PR while at IBM. At the time I don't think IBM signed a CCLA with Canonical. I no longer work there.Is it possible we could get this in under "trivial commit" ? Or perhaps all this code is authored by @flokli who can sign the CLA.
I'll check internally, but I'm not sure that there is much that we can do unless we can get a CLA signature from IBM.
cc @jjasghar @bradtopol
Any insight would be appreciated.
I realized you actually can change the timeout of .device by using the
x-systemd.device-timeout
mount option in your fstab line, or theDefaultDeviceTimeout=
systemd.conf
setting, so I might not be needing this workaround after all, at least for my specific usecase.@flokli Let me know if that works. It looks like the existence of the .device unit is required for this command (much like
x-systemd.makefs
that I pointed out above), so I'm guessing that this won't help since the transaction will still fail to include the mount unit because the device unit still won't exist on first boot.
You need to follow one more reference ;-) is_device_path
just matches the passed string with some known-good path strings: https://github.com/systemd/systemd/blob/1fdab6c3069ab945259d70e22a29e80da8370288/src/basic/path-util.c#L1285
The approach using x-systemd.device-timeout
worked in principle - but I then ran into the issue that sometimes the azurerm_linux_virtual_machine
resource /itself/ blocked returninguntil the machine finished booting up - which it never does if it waits for the block device to be attached, which only happens after the resource returns :see_no_evil:
In the end I gave up on this journey, and resorted to the azurerm_virtual_machine
resource, which does support attaching data disks to VMs at creation time.
I assume keeping systemd about the mountpoints in the dark, and all the waiting / filesystem creation / mounting in cloud-init would also work, so I still think this PR is useful to have.
You need to follow one more reference ;-) is_device_path just matches the passed string with some known-good path strings:
you're right again :P
The approach using x-systemd.device-timeout worked in principle - but I then ran into the issue that sometimes the azurerm_linux_virtual_machine resource /itself/ blocked returninguntil the machine finished booting up - which it never does if it waits for the block device to be attached, which only happens after the resource returns 🙈
oof
In the end I gave up on this journey, and resorted to the azurerm_virtual_machine resource, which does support attaching data disks to VMs at creation time.
:+1:, glad you got something figured out for your use case
I assume keeping systemd about the mountpoints in the dark, and all the waiting / filesystem creation / mounting in cloud-init would also work, so I still think this PR is useful to have.
Agreed
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.
If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.
(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.
If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.
(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.
If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.
(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.
If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.
(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)
We're still waiting on the CLA for this PR, correct? Any updates?
This project requires CLA, not DCO. Whose code is this, @flokli's or @nibalizer's? If @nibalizer, I need you to sign the DCO please.
I assume you need CLA not DCO.
I opened this PR while at IBM. At the time I don't think IBM signed a CCLA with Canonical. I no longer work there. Is it possible we could get this in under "trivial commit" ? Or perhaps all this code is authored by @flokli who can sign the CLA.
cc @jjasghar @bradtopol
This is PR is stuck on this. I don't work for IBM, and didn't plan to. If this can't be considered a trivial contribution that doesn't need a CLA, someone at canonical needs to convince IBM to sign this.