tectonic-installer icon indicating copy to clipboard operation
tectonic-installer copied to clipboard

waagent.service on azure provisioned nodes are failing

Open tlyng opened this issue 7 years ago • 8 comments

Have had problems with disk attachment/detachment on Azure. Noticed that Azure say that the node is in "provisioningState" "updating". Doing some further digging I discovered that the waagent.service extension installed was in this "provisioningState": "updating" as well. When in this state it is difficult to attach / detach PVCs to the node.

Failing logs for waagent.service:

Oct 24 10:54:45 smblab-ne-worker-2 python[903]: 2017/10/24 10:54:45.667140 INFO [Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.4.45.3] Remove extension handler directory: /var/lib/waagent/Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.4.45.3
Oct 24 10:54:45 smblab-ne-worker-2 python[903]: 2017/10/24 10:54:45.687577 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=, duration=12784
Oct 24 11:10:06 smblab-ne-worker-2 python[903]: 2017/10/24 11:10:06.020831 INFO Event: name=WALinuxAgent, op=HeartBeat, message=, duration=0
Oct 24 11:22:05 smblab-ne-worker-2 python[903]: 2017/10/24 11:22:05.689854 INFO [Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.4.45.3] Target handler state: enabled
Oct 24 11:22:05 smblab-ne-worker-2 python[903]: 2017/10/24 11:22:05.690702 INFO [Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.4.45.3] [Enable] current handler state is: notinstalled
Oct 24 11:22:05 smblab-ne-worker-2 python[903]: 2017/10/24 11:22:05.691455 WARNING Failed to download from: https://rdfepirv2db3prdstr02.blob.core.windows.net/8b23413e0ce64aa89889a295b6122112/Microsoft.EnterpriseCloud.Monitoring__OmsAgentForLinux__1.4.45.3
Oct 24 11:22:05 smblab-ne-worker-2 python[903]: 2017/10/24 11:22:05.691717 WARNING Download did not succeed, falling back to host plugin
Oct 24 11:22:08 smblab-ne-worker-2 python[903]: 2017/10/24 11:22:08.904021 INFO Event: name=Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux, op=Download, message=Download succeeded, duration=0
Oct 24 11:22:08 smblab-ne-worker-2 python[903]: 2017/10/24 11:22:08.905118 INFO [Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.4.45.3] Initialize extension directory
Oct 24 11:22:08 smblab-ne-worker-2 python[903]: 2017/10/24 11:22:08.906024 INFO [Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.4.45.3] Update settings file: 0.settings
Oct 24 11:22:08 smblab-ne-worker-2 python[903]: 2017/10/24 11:22:08.906394 INFO [Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.4.45.3] Install extension [omsagent.py -install]
Oct 24 11:22:09 smblab-ne-worker-2 python[903]: 2017/10/24 11:22:09.910347 ERROR Event: name=Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux, op=Install, message=[000003] Non-zero exit code: 51, omsagent.py -install, duration=0
Oct 24 11:22:09 smblab-ne-worker-2 python[903]: 2017/10/24 11:22:09.927789 INFO [Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.4.45.3] Remove extension handler directory: /var/lib/waagent/Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.4.45.3
Oct 24 11:22:09 smblab-ne-worker-2 python[903]: 2017/10/24 11:22:09.946773 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=, duration=7348
Oct 24 11:40:08 smblab-ne-worker-2 python[903]: 2017/10/24 11:40:08.235418 INFO Event: name=WALinuxAgent, op=HeartBeat, message=, duration=0
Oct 24 12:10:09 smblab-ne-worker-2 python[903]: 2017/10/24 12:10:09.440165 INFO Event: name=WALinuxAgent, op=HeartBeat, message=, duration=0

By visiting the Azure Portal and uninstalling the extension the VM provisioningState moved into "Succeeded" state and disks started attaching again.

tlyng avatar Oct 24 '17 12:10 tlyng

I am seeing a similar provisioning problem as well. The problem has to do with a fresh VM with attached data disks. The provision fails and in the boot logs I see.

2017/10/27 18:37:22.343545 INFO WALinuxAgent-2.2.18 running as process 1175
2017/10/27 18:37:22.348321 INFO Event: name=WALinuxAgent, op=AutoUpdate, message=, duration=0
2017/10/27 18:37:22.351943 INFO Wire server endpoint:168.63.129.16
2017/10/27 18:37:22.355027 ERROR Command: 'iptables -w -t security --flush'
2017/10/27 18:37:22.358052 ERROR Return code: 3
2017/10/27 18:37:22.360457 ERROR Result: iptables v1.4.21: can't initialize iptables table `security': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.

2017/10/27 18:37:22.361418 INFO Unable to flush firewall -- no further attempts will be made: non-zero return code
2017/10/27 18:37:22.372187 INFO Event: name=WALinuxAgent, op=InitializeHostPlugin, message=, duration=0

If I go back to 1465.6.0 it works. I tried using 1520.6.0 and 1520.8.0. Machines that upgrade from 1465 to 1520 seem to continue to work without issue. I went back to 1465.6.0 because that is the last version that we used for fresh VM deployments. It is likely one of the version bumps since then that bumped kernel version is to blame.

cehoffman avatar Oct 27 '17 18:10 cehoffman

I want to add if this isn't clear. You can not currently provision VMs that use data disks with the current stable Container Linux version on Azure due to this missing kernel module.

cehoffman avatar Oct 27 '17 21:10 cehoffman

I've had no issues using premium lrs disks after I disabled the waagent.

tlyng avatar Oct 29 '17 19:10 tlyng

@crawford @euank can we get some attention on this one.

coresolve avatar Nov 09 '17 20:11 coresolve

@tlyng That extension doesn't support Container Linux (per their readme).

Azure extensions often don't work on Container Linux.

Uninstalling the incompatible extension is the correct way to resolve this I think. Perhaps our azure docs could call extensions out explicitly.


@cehoffman Does the iptables error message wagent logs result in any real impact? We should include that iptables module certainly, but my test machine appears to be fine still without it.

euank avatar Nov 09 '17 23:11 euank

@euank you’re right this missing module seems to be fine since it exists when deploying without data disks.

The problem I have is that the same ignition config works on 1465 and 1520 when not using data disks. When I provision with data disks I can only provision with 1465 and then upgrade to 1520. The only addition to the ignition config is to create a raid from the data disks and that completes just fine in the boot logs on 1520. It seems that something fails to post back to Azure to signal successful creation on 1520. Additionally the same config minus the disk raid and data disks works with 1520.

cehoffman avatar Nov 10 '17 08:11 cehoffman

@tlyng / @cehoffman -- are you still having issues with this?

justaugustus avatar Dec 12 '17 04:12 justaugustus

@justaugustus yes, it seems to be flakey in so much that new OS releases seem to fail for some period of time and then they will work. For example, 1520.9.0 worked with the exact same ignition configuration and VM size on Dec 2 as had been failing on my Nov 10th comment.

Today I tried using the same ignition configuration and same VM size using 1576.4.0 and it failed to leave provisioning status in the Azure API. The VM came up just fine as far as I can tell, but terraform never received confirmation since Azure thought it was still provisioning and hence terraform thinks the VM does not exist.

In the logs for waagent it does print that provision succeeded in the boot before tectonic-torx reboots the OS to revert docker in the 1576.4.0 from 17.09 to 1.12. This is the journalctl log of that instance that failed and a successful provision of the same configuration with 1520.9.0 as the base image.

I tried with the most recent OS release a couple times before reverting to 1520.9.0 and having k8s-node-bootstrap do the OS upgrade.

cehoffman avatar Dec 12 '17 06:12 cehoffman