open-vm-tools icon indicating copy to clipboard operation
open-vm-tools copied to clipboard

wait for cloud-init execution to finish breaks previous behavior

Open vitality411 opened this issue 1 year ago • 18 comments

Describe the bug

Hello, thanks to @lethargosapatheia and issue https://github.com/canonical/cloud-init/issues/4188 in cloud-init repository I was able to find the root cause of the following issue.

Since Ubuntu 22.04 20230602 cloud image version the behavior of cloud-init has unexpectedly changed. Until this version the virtual machine would start, run cloud-init init-local stage, reboot and run the remaining cloud-init stages correctly. In this version cloud-init starts additional stages besides 'init-local' during first boot (see attached cloud-init analyze show output). During these stages it is terminated prematurely by deployPkg. I found out that this is due to https://github.com/vmware/open-vm-tools/commit/cd995a58b07a91d7804d9fdec5545a5fe11e9db9 which changed deployPkg plugin behavior to wait for cloud-init execution to finish. If the cloud-init execution is not finished during default timeout 30s it is killed.

This behavior disturbs automatic provisioning, which rely on correct application of cloud-init settings. In my case, kubermatic/machine-controller starts doing the provisioning through userdata and it's interrupted by the reboot, making it impossible to automatically provision new Kubernetes nodes.

I can confirm the provisioning works properly with version 20230518 of Ubuntu 22.04, where cloud-init is executed correctly, without being terminated prematurely.

Environment details Cloud-init versions are identical on both cloudimage versions:

ii  cloud-guest-utils                    0.32-22-g45fe84a5-0ubuntu1              all          cloud guest utilities
ii  cloud-init                           23.1.2-0ubuntu0~22.04.1                 all          initialization and customization tool for cloud instances
ii  cloud-initramfs-copymods             0.47ubuntu1                             all          copy initramfs modules into root filesystem for later use
ii  cloud-initramfs-dyn-netconf          0.47ubuntu1                             all          write a network interface file in /run for BOOTIF

open-vm-tools

Working: VMware Tools Version: 12.1.0.37487 (build-20219665)
Broken: VMware Tools Version: 12.1.5.39265 (build-20735119)

Operating System Distribution: Ubuntu 22.04.2 LTS Cloud provider, platform or installer type: VMware Cloud Director/OVA

Logs I'm uploading the relevant logs for both images. broken.tar.gz working.tar.gz

Best regards!

Reproduction steps

Working

  1. Deploy VM using Ubuntu 22.04 20230518 cloud image on VMware environment providing userdata which takes longer than 30sec to execute. For example, installing multiple packages and configuring the VM for Kubernetes. I attached an example userdata file.

  2. Verify cloud-init was able to execute only the init-local stage on first boot:

cloud-init analyze show
-- Boot Record 01 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: init-local
|`->no cache found @00.00300s +00.00000s
|`->found local data from DataSourceOVF @00.01200s +00.00700s
Finished stage: (init-local) 00.25800 seconds

Total Time: 0.25800 seconds

-- Boot Record 02 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: init-local
|`->cache invalid in datasource: DataSourceOVF [seed=com.vmware.guestInfo] @00.00400s +00.00900s
|`->found local data from DataSourceOVF @00.01400s +00.00700s
Finished stage: (init-local) 00.06100 seconds

Starting stage: init-network
...
  1. Verify /var/log/vmware-imc/toolsDeployPkg.log does not contain the following lines:
[2023-08-25T08:18:48.556Z] [    info] Do not trigger reboot if cloud-init is executing.
[2023-08-25T08:18:48.857Z] [    info] Cloud-init status is 'running'.
...
[2023-08-25T08:19:16.029Z] [    info] Cloud-init status is 'running'.
[2023-08-25T08:19:21.029Z] [    info] Timed out waiting for cloud-init execution done.

Broken

  1. Deploy VM using Ubuntu 22.04 20230602 cloud image on VMware environment providing userdata which takes longer than 30sec to execute. For example, installing multiple packages and configuring the VM for Kubernetes. I attached an example userdata file.

  2. Verify cloud-init tried to execute multiple stages besides init-local stage on first boot:

-- Boot Record 01 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: init-local
|`->no cache found @00.00500s +00.00000s
|`->found local data from DataSourceOVF @00.12200s +00.00800s
Finished stage: (init-local) 00.42600 seconds

Starting stage: init-network
...
Finished stage: (init-network) 01.55400 seconds

Starting stage: modules-config
...

Total Time: 14.33200 seconds

-- Boot Record 02 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: init-local
|`->cache invalid in datasource: DataSourceOVF [seed=com.vmware.guestInfo] @00.00400s +00.01000s
|`->found local data from DataSourceOVF @00.01500s +00.00700s
Finished stage: (init-local) 00.06700 seconds

Starting stage: init-network
...
  1. Check /var/log/vmware-imc/toolsDeployPkg.log for the following lines:
[2023-08-25T08:18:48.556Z] [    info] Do not trigger reboot if cloud-init is executing.
[2023-08-25T08:18:48.857Z] [    info] Cloud-init status is 'running'.
...
[2023-08-25T08:19:16.029Z] [    info] Cloud-init status is 'running'.
[2023-08-25T08:19:21.029Z] [    info] Timed out waiting for cloud-init execution done.
[2023-08-25T08:19:21.130Z] [    info] Trigger reboot.
[2023-08-25T08:19:22.049Z] [    info] Rebooting.
[2023-08-25T08:19:23.150Z] [    info] Reboot has been triggered.

  1. Check journalctl for prematurely termination of cloud-init services
Aug 25 08:19:21 broken systemd[1]: cloud-final.service: Main process exited, code=exited, status=1/FAILURE
Aug 25 08:19:21 broken systemd[1]: cloud-final.service: Failed with result 'exit-code'.
Aug 25 08:19:21 broken systemd[1]: Stopped Execute cloud user/final scripts.
-- Boot 1673d9c3f266400881bfd1b73933b494 --
Aug 25 08:19:54 broken systemd[1]: Starting Execute cloud user/final scripts...

Expected behavior

No breaking change

I understand that this change was required to resolve issues where users want to set a vm's networking and apply cloud-init userdata together before the vm is booted. But still I find it bad practice to change the previous default and break previous working configuration for others. Previously I was able to use the cloud image without modification. Now I have to build my own image with wait-cloudinit-timeout=0 just to restore the previous behavior.

Additional context

No response

vitality411 avatar Aug 28 '23 05:08 vitality411

@vitality411

Thank you for reporting this issue. An internal bug has been filed.

jonathanvmw avatar Aug 28 '23 15:08 jonathanvmw

@vitality411 , Thanks for the detailed description of your issue and mentioning the issue in https://github.com/canonical/cloud-init/issues/4188 reported by @lethargosapatheia

First of all, there is a behaviour change with https://github.com/vmware/open-vm-tools/commit/cd995a58b07a91d7804d9fdec5545a5fe11e9db9 since VMware Tools version 12.1.5 when users want to customize a Linux VM(update hostname, network and etc) and at the same time feed userdata to cloud-init to apply in the same boot.

The reason we made this change is users reported the reboot triggered by guest customization (as @lethargosapatheia mentioned: by calling /etc/teliniit 6) interupts cloud-init execution. In your working case, I see reboot happened right after init-local stage, but in some other cases, reboot could happen in init-network or even modules-config/modules-final stage which is applying userdata, so we want to let cloud-init finish its work before reboot vm.

Now I think it's tricky to set default value of wait-cloudinit-timeout, 30 seconds are not enough for some userdata. @vitality411 What if you set wait-cloudinit-timeout=300 to let cloud-init finish execution before the reboot, will it break your workflow? I see in your working case, the total cloud-init execution time is 290 seconds. I know this still can NOT resolve your problem, still need manually build image.

Total Time: 290.29000 seconds

2 boot records analyzed

@lethargosapatheia , For your case, you mentioned userdata can be applied in 30 seconds, but packer (and terraform for that matter) will connect to the virtual machines over ssh before a last reboot, and the provisioning will be interrupted. I understand that ssh is not ready before reboot before this behaviour change, but now ssh is ready before reboot. The workaround provided by @vitality411 shall work for you. In the meanwhile, can you hold provisioning until the reboot happenes? There is always a reboot. Want to understand your workflow more, share your packer configuration here?

We have an internal bug tracking this issue, thanks @jonathanvmw .

Best regards, Pengpeng

PengpengSun avatar Aug 29 '23 10:08 PengpengSun

@PengpengSun Thanks. When I set wait-cloudinit-timeout=300 cloud-init is able to finish execution before the reboot and the workflow is not terminated prematurely by deployPkg. Still requires a custom image.

vitality411 avatar Aug 29 '23 13:08 vitality411

@PengpengSun The short answer is "no", I cannot stop the provisioning, there is no way packer or terraform can automatically know when the right moment to connect to the virtual machine over ssh is. They just try until the ssh server is up and running.

So, for instance, the whole process starting from the cloning of the cloudimage with packer is automatic. The only thing I could probably do is maybe delay the ssh connectivity, which, of course, is a bad solution, because that interval will be relative.

There is always a reboot.

I beg to differ and I think the crux of the problem is exactly the fact that users don't have a choice here. Just like cloud-init doesn't require rebooting the machine if you don't tell it to, there is no reason why vmware-tools should do that or better said, there is no reason why the user shouldn't be able to decide whether the virtual machine reboots or not. And the proof for that is exactly my workaround, which is removing the telinit symlink, which in turn makes the whole process much faster without any apparent issues.

vitaly411's workaround (setting wait-cloudinit-timeout to 0) will probably solve the issue (I haven't tested it) in that the vmware-tools will reboot earlier. But as far as I understand, this relies on what is, for all intents and purposes, a hack, where vmware-tools would more or less randomly reboot the virtual machine just before cloud-init reaches the init-network stage. It's not as if vmware-tools explicitly wait for init-local stage to finish and then reboot, right? It just happens that it reboots before the init-network is initiated.

So the real solution is just this: give users the possibility of not rebooting the virtual machine. You don't need to change the default.

lethargosapatheia avatar Aug 31 '23 09:08 lethargosapatheia

@vitality411 and @lethargosapatheia Thanks for your reply. I understand your workflows a bit more, customizing network and applying cluod-init userdata together in a single boot is a scenario which we want to make it work smoothly. I see @vitality411 is putting userdata to ovfenv.xml format and set in guestinfo per DataSourceOVF [seed=com.vmware.guestInfo], I'm not sure which data transfort @lethargosapatheia is using. But I have 2 recommendations:

  1. Have you checked https://cloudinit.readthedocs.io/en/latest/reference/datasources/vmware.html#guestinfo-keys, you can put network configuration into guestinfo.metadata and userdata into guestinfo.userdata, in this way, all networking and userdata will be handled by cloud-init, no need customization task involved, no rebooting.
  2. You can set both network configuration and userdata in customization spec, check the solution 3 in this KB 90331, now you are using solution 1. This API doc is telling how to put both network configuration and userdata into customization spec "CloudInitPrep", no rebooting.

And for the not rebooting suggestion, we have few blockers to resolve, ex: how to make sure customized network settings take effect(Not only ubuntu, but all Linux distroes we support), what if a hardware change happenes(adding/removing network adapters) before customize vm network settings.

Back to this behavior change, this is another solution we want to provide, see solution 1 in KB 90331. As I mentioned, rebooting could happen at any stages of cloud-init execution before this change. Now we have an internal bug to trace this issue, will post here if I have any updates.

PengpengSun avatar Aug 31 '23 12:08 PengpengSun

@PengpengSun I am running on VMware Cloud Director 10.4. According to https://github.com/vmware/cluster-api-provider-cloud-director/issues/506 and the linked Slack thread it is currently to possible to use native cloud-init customization.

vitality411 avatar Sep 04 '23 05:09 vitality411

@vitality411 Yes, this is a long standing issue, but we are working on addressing it recently. One workaround descripbed in this KB: https://kb.vmware.com/s/article/71264, in your case, you can add one more runcmd in userdata to add the setting manual_cache_clean: True to the /etc/cloud/cloud.cfg during the first boot, then reboot won't set network to fallback settings.

PengpengSun avatar Sep 04 '23 14:09 PengpengSun

@PengpengSun I would appreciate if you could share the progress on this issue.

vitality411 avatar Dec 06 '23 09:12 vitality411

@vitality411 We are working on a change to add a new field into customization specification, when people creating a spec, they can set this new field, the value(number in seconds) of this field will overwrite the value of wait-cloudinit-timeout in vm. For your case, you can set this field to 0 or 300. How does this change sound to you?

PengpengSun avatar Dec 07 '23 02:12 PengpengSun

Could this be related to my problem ?

https://github.com/canonical/cloud-init/issues/4977

The behaviour sound quite similar to my problem..

erSitzt avatar Feb 29 '24 13:02 erSitzt

After looking at your issue, it doesn't seem to be related. To verify you can try if it works with Ubuntu 22.04 20230518 and no longer works with Ubuntu 22.04 20230602.

Have you taken a look at cloud-init analyze show for both cases?

vitality411 avatar Mar 01 '24 06:03 vitality411

@vitality411 yeah thanks, it looks like a change in cloud-init`s ds-identify causes my problem

erSitzt avatar Mar 06 '24 09:03 erSitzt

@PengpengSun Any news on the change you were working on?

vitality411 avatar May 03 '24 06:05 vitality411

@vitality411 The adding a new field into customization specification solution is not available yet, but I think the issue mentioned https://github.com/vmware/cluster-api-provider-cloud-director/issues/506 has been addressed recently, please check https://github.com/canonical/cloud-init/pull/4997.

For now, I still suggest you try one of below solutions as I mentioned:

  1. https://cloudinit.readthedocs.io/en/latest/reference/datasources/vmware.html#guestinfo-keys, you can put network configuration into guestinfo.metadata and userdata into guestinfo.userdata, in this way, all networking and userdata will be handled by cloud-init, no need customization task involved, no rebooting.
  2. You can set both network configuration and userdata in customization spec, check the solution 3 in this KB 90331, now you are using solution 1. This API doc is telling how to put both network configuration and userdata into customization spec "CloudInitPrep", no rebooting.

PengpengSun avatar May 06 '24 09:05 PengpengSun