elemental Cluster creation gets stuck while provisioning

What steps did you take and what happened: Following the quickstart guide. Once I install the OS and reboot, I see the elemental-system-agent (ESA) get preemepted by the rancher-system-agent (RSA) but then the RSA sits there and doesn't install k3s.

At this point, I can see that everything is set up correctly in the upstream CRDs but the MachineInventory is stuck waiting for the plan to be applied. The discrepancy between the applied and plan checksum in the status field shows this to be the case.

When I log into the host and view the journal for ESA, I see that the plan was indeed applied and that the MachineInventory status is not correct. Since I've seen this before, I guessed that it was because the status didn't get synced before the RSA started and stopped the ESA from running.

A workaround to get my cluster up and running is to run these commands:

systemctl stop rancher-system-agent
mv rancher-system-agent.service rancher-system-agent.service.back
systemctl start elemental-system-agent
sleep 5
systemctl stop elemental-system-agent
mv rancher-system-agent.service.back rancher-system-agent.service
systemctl start rancher-system-agent

The cluster should start being provisioned after this.

What did you expect to happen: I expect the cluster to be created without manual intervention.

Anything else you would like to add:

I'll be demoing to the community in a master class next Wednesday and would prefer to not need a manual workaround for this.

Environment:

Elemental release version (use cat /etc/os-release): 0.6.0
Rancher version: 2.6.7
Kubernetes version (use kubectl version): upstream 1.24.10+k3s1 -- downstream v1.23.7+k3s1
Cloud provider or hardware configuration: digital ocean upstream -- Intel NUCs downstream

Sep 28 '22 17:09 agracey

rancher-system-agent (RSA) tears down the elemental-system-agent (ESA) before moving on to the cluster provisioning plan. In this case as stated above it does it a bit too early, likely before the ESA is able to communicate that the rancher-system-agent deployment plan was successful. Proper solution is to rework the handover from ESA to RSA.

In the meanwhile, one trick that would help is to wait stopping the elemental-system-agent when the stop command is received. Adding a section like ExecStop=sleep 10 to the elemental-system-agent.service in the ISO should make the trick.

I tested adding a long delay (5 minutes) before tearing down the elemental-system-agent. That required to change also the default stop timeout in systemd for the ESA service: ExecStop=sleep 300 TimeoutStopSec=6min

While this worked, the issue with a long timeout is that no other plans will be executed till elemental-system-agent stop one is completed, so the cluster installation had to wait quite a bit (5 minutes).

Sep 30 '22 11:09 fgiudici

Is this a change that could be published today or Monday? It seems low risk

Sep 30 '22 13:09 agracey

Opened a PR, yep looks really low risk. Another work around (shared by @davidcassany ) could be to just avoid disabling the elemental-system-agent at all. This would require a change in the operator. Would leave the proper fix for later as would require surely more time (and would be more invasive).

Sep 30 '22 17:09 fgiudici

A good follow up for this issue could be a two step action point:

Further experiment with sharing elemental-system-agent and rancher-system-agent configuration options. Is there a chance that one replace the other or one handles the plans of the other? This on early days miserably failed, but at that point we were not familiar or properly knowing with the full cycle.
With all the gathered knowledge get in contact with system-agent and provisioning team so we can eventually evaluate the chances of adapting the pieces for our use case.

Oct 11 '22 07:10 davidcassany

elemental elemental copied to clipboard

Cluster creation gets stuck while provisioning

elemental
elemental copied to clipboard