elemental
elemental copied to clipboard
Cluster creation gets stuck while provisioning
What steps did you take and what happened: Following the quickstart guide. Once I install the OS and reboot, I see the elemental-system-agent (ESA) get preemepted by the rancher-system-agent (RSA) but then the RSA sits there and doesn't install k3s.
At this point, I can see that everything is set up correctly in the upstream CRDs but the MachineInventory is stuck waiting for the plan to be applied. The discrepancy between the applied and plan checksum in the status field shows this to be the case.
When I log into the host and view the journal for ESA, I see that the plan was indeed applied and that the MachineInventory status is not correct. Since I've seen this before, I guessed that it was because the status didn't get synced before the RSA started and stopped the ESA from running.
A workaround to get my cluster up and running is to run these commands:
- systemctl stop rancher-system-agent
- mv rancher-system-agent.service rancher-system-agent.service.back
- systemctl start elemental-system-agent
- sleep 5
- systemctl stop elemental-system-agent
- mv rancher-system-agent.service.back rancher-system-agent.service
- systemctl start rancher-system-agent
The cluster should start being provisioned after this.
What did you expect to happen: I expect the cluster to be created without manual intervention.
Anything else you would like to add:
I'll be demoing to the community in a master class next Wednesday and would prefer to not need a manual workaround for this.
Environment:
- Elemental release version (use
cat /etc/os-release
): 0.6.0 - Rancher version: 2.6.7
- Kubernetes version (use
kubectl version
): upstream 1.24.10+k3s1 -- downstream v1.23.7+k3s1 - Cloud provider or hardware configuration: digital ocean upstream -- Intel NUCs downstream
rancher-system-agent (RSA) tears down the elemental-system-agent (ESA) before moving on to the cluster provisioning plan. In this case as stated above it does it a bit too early, likely before the ESA is able to communicate that the rancher-system-agent deployment plan was successful. Proper solution is to rework the handover from ESA to RSA.
In the meanwhile, one trick that would help is to wait stopping the elemental-system-agent when the stop command is received.
Adding a section like ExecStop=sleep 10
to the elemental-system-agent.service in the ISO should make the trick.
I tested adding a long delay (5 minutes) before tearing down the elemental-system-agent. That required to change also the default stop timeout in systemd for the ESA service:
ExecStop=sleep 300
TimeoutStopSec=6min
While this worked, the issue with a long timeout is that no other plans will be executed till elemental-system-agent stop one is completed, so the cluster installation had to wait quite a bit (5 minutes).
Is this a change that could be published today or Monday? It seems low risk
Opened a PR, yep looks really low risk. Another work around (shared by @davidcassany ) could be to just avoid disabling the elemental-system-agent at all. This would require a change in the operator. Would leave the proper fix for later as would require surely more time (and would be more invasive).
A good follow up for this issue could be a two step action point:
- Further experiment with sharing elemental-system-agent and rancher-system-agent configuration options. Is there a chance that one replace the other or one handles the plans of the other? This on early days miserably failed, but at that point we were not familiar or properly knowing with the full cycle.
- With all the gathered knowledge get in contact with system-agent and provisioning team so we can eventually evaluate the chances of adapting the pieces for our use case.