elemental Upgrade does not triggered (sometimes)

This issue has been spotted during the analyze regarding CI flaky tests (https://github.com/rancher/elemental/issues/383)

First of all, the issue is sporadic, sometimes tests pass and sometimes not.

What steps did you take and what happened: I have one Elemental cluster node installed through the Rancher UI. Purpose of the test is to validate the update of my cluster to this image: quay.io/costoolkit/elemental-ci:latest

To update my node, I create an OS Images Upgrades resource like this

Once done I'm checking that the cluster status changes from Active to Updating state:

And this is where the issue happens, sometimes the upgrade is not triggered and the status stays Active and we hit the timeout in the test.

Issue in the CI https://github.com/rancher/elemental/actions/runs/3225014880

Video https://user-images.githubusercontent.com/6025636/195086990-4a0224b8-048c-45f4-a0cc-6d6ded377d0c.mp4

What did you expect to happen: The upgrade should be triggered each time.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] I tested to increase timeouts in Cypress code and I also migrated to the latest Cypress version.

If I remember correctly, I was able to reproduce it manually one time and at the end, the upgrade was triggered but something like 20 minutes after OS Images Upgrades resource got created....

We do not see this issue happen in the backup e2e test and we also use ManagedOSImage there.

Environment:

Elemental release version (use cat /etc/os-release):

rancher-26437:~ # cat /etc/os-release 
NAME="SLE Micro"
VERSION="5.3"
VERSION_ID="5.3"
PRETTY_NAME="SUSE Linux Enterprise Micro for Rancher 5.3"
ID="sle-micro-rancher"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sle-micro-rancher:5.3"
IMAGE_REPO="registry.opensuse.org/isv/rancher/elemental/teal53/15.4/rancher/elemental-node-image/5.3"
IMAGE_TAG="20.14"
IMAGE="registry.opensuse.org/isv/rancher/elemental/teal53/15.4/rancher/elemental-node-image/5.3:20.14"
TIMESTAMP=20221004064711
GRUB_ENTRY_NAME="Elemental"

Rancher version: 2.6.8
Kubernetes version (use kubectl version): v1.23.6+k3s1
Cloud provider or hardware configuration: Libvirt

Oct 11 '22 12:10 juadk

The update command should be visible in the elemental-operator log. If it's there, it's an operator problem. If not, it's a UI problem.

Oct 11 '22 12:10 kkaempf

Ok, let me add something in the code to upload the elemental-operator log at the end of every tests.

Oct 11 '22 13:10 juadk

It's sufficient to upload logs of failed tests :wink:

Oct 11 '22 13:10 kkaempf

Looks like the log is not very big...

time="2022-10-11T15:06:05Z" level=error msg="error syncing 'fleet-default/nc-myelementalcluster-pool1-mts96-b4bc4': handler machine-inventory-selector-bootstrap: cannot find machine, requeuing"
time="2022-10-11T15:06:05Z" level=error msg="error syncing 'fleet-default/nc-myelementalcluster-pool1-mts96-b4bc4': handler machine-inventory-selector-bootstrap: cannot find machine, requeuing"
time="2022-10-11T15:06:05Z" level=error msg="error syncing 'fleet-default/nc-myelementalcluster-pool1-mts96-b4bc4': handler machine-inventory-selector-bootstrap: cannot find machine, requeuing"
time="2022-10-11T15:06:05Z" level=error msg="error syncing 'fleet-default/nc-myelementalcluster-pool1-mts96-b4bc4': handler machine-inventory-selector-bootstrap: cannot find machine, requeuing"
time="2022-10-11T15:06:06Z" level=error msg="error syncing 'fleet-default/nc-myelementalcluster-pool1-mts96-b4bc4': handler machine-inventory-selector-bootstrap: cannot find machine, requeuing"
time="2022-10-11T15:06:06Z" level=error msg="error syncing 'fleet-default/nc-myelementalcluster-pool1-mts96-b4bc4': handler machine-inventory-selector-bootstrap: cannot find machine, requeuing"
time="2022-10-11T15:10:15Z" level=info msg="Creating event broadcaster for mos-sync"
time="2022-10-11T15:10:15Z" level=info msg="Requeueing sync from controller"
time="2022-10-11T15:10:20Z" level=info msg="Creating event broadcaster for mos-sync"
time="2022-10-11T15:10:20Z" level=info msg="Syncing 'fleet-default/os-versions' (JSON)"

I gathered it with this command: kubectl logs --selector=app=elemental-operator -n cattle-elemental-system > elemental-operator.log

Oct 11 '22 15:10 juadk

@Itxaka do you know which logs could be useful to debug this please?

Oct 11 '22 15:10 juadk

Rancher logs maybe but on debug, operator logs also on debug. No idea what else, would need to actually be able to reproduce to understand because it doesn't make much sense.

Oct 11 '22 15:10 Itxaka

Have we discarded it being an UI/cache issue, i.e. not being properly refreshed? Like does the issue happen if the page is reloaded?

Tests should probably launch elemental-support on teh nodes on failure and get the resulting artifact on the job so we can explore the whole system on failures.

Oct 11 '22 15:10 Itxaka

UI/cache issue: I don't think so because each time, the environment is a fresh one.

elemental-support? interesting! what's this? kind a supportconfig I guess.

Oct 11 '22 15:10 juadk

A test just passed and the operator's log is exactly the same. https://github.com/rancher/elemental/actions/runs/3228180049

Oct 11 '22 16:10 juadk

Logs generated by elemental-support node-ed73e312-a98d-4a67-8620-e86869cab2d3-2022-10-12T09:50:40Z.tar.gz

Oct 12 '22 09:10 juadk

Ok, finally, the upgrade started 20 minutes later... Look like something delayed it. I'm joining the ellemental-support taken after the upgrade. node-ed73e312-a98d-4a67-8620-e86869cab2d3-2022-10-12T10:16:39Z.tar.gz PS: I'm suspecting etcd

Oct 12 '22 11:10 juadk

I've tried to disable etcd snapshot, a test is running.

Oct 12 '22 12:10 juadk

ok the test is green! https://github.com/rancher/elemental/actions/runs/3234574017/jobs/5297857273 Luck or real fix? let's trigger it one more time. https://github.com/rancher/elemental/actions/runs/3234832989/jobs/5298445291

Oct 12 '22 12:10 juadk

I can see an error on the logs attaches on the k3s.log about the os-upgrader job failing:

2022-10-12T10:12:38+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1552]: I1012 10:12:38.290473    1552 reconciler.go:221] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-bj7gl\" (UniqueName: \"kubernetes.io/projected/5daf17df-4789-4f6d-a887-777fea38d10f-kube-api-access-bj7gl\") pod \"apply-os-upgrader-on-node-ed73e312-a98d-4a67-8620-e86869c-dqdnz\" (UID: \"5daf17df-4789-4f6d-a887-777fea38d10f\") " pod="cattle-system/apply-os-upgrader-on-node-ed73e312-a98d-4a67-8620-e86869c-dqdnz"
2022-10-12T10:12:38+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1552]: I1012 10:12:38.290536    1552 reconciler.go:221] "operationExecutor.VerifyControllerAttachedVolume started for volume \"host-root\" (UniqueName: \"kubernetes.io/host-path/5daf17df-4789-4f6d-a887-777fea38d10f-host-root\") pod \"apply-os-upgrader-on-node-ed73e312-a98d-4a67-8620-e86869c-dqdnz\" (UID: \"5daf17df-4789-4f6d-a887-777fea38d10f\") " pod="cattle-system/apply-os-upgrader-on-node-ed73e312-a98d-4a67-8620-e86869c-dqdnz"
2022-10-12T10:12:38+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1552]: I1012 10:12:38.290568    1552 reconciler.go:221] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pod-info\" (UniqueName: \"kubernetes.io/downward-api/5daf17df-4789-4f6d-a887-777fea38d10f-pod-info\") pod \"apply-os-upgrader-on-node-ed73e312-a98d-4a67-8620-e86869c-dqdnz\" (UID: \"5daf17df-4789-4f6d-a887-777fea38d10f\") " pod="cattle-system/apply-os-upgrader-on-node-ed73e312-a98d-4a67-8620-e86869c-dqdnz"
2022-10-12T10:12:38+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1552]: I1012 10:12:38.290630    1552 reconciler.go:221] "operationExecutor.VerifyControllerAttachedVolume started for volume \"secret-os-upgrader-data\" (UniqueName: \"kubernetes.io/secret/5daf17df-4789-4f6d-a887-777fea38d10f-secret-os-upgrader-data\") pod \"apply-os-upgrader-on-node-ed73e312-a98d-4a67-8620-e86869c-dqdnz\" (UID: \"5daf17df-4789-4f6d-a887-777fea38d10f\") " pod="cattle-system/apply-os-upgrader-on-node-ed73e312-a98d-4a67-8620-e86869c-dqdnz"
2022-10-12T10:12:38+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1552]: I1012 10:12:38.290667    1552 shared_informer.go:247] Caches are synced for PVC protection
2022-10-12T10:12:38+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1552]: W1012 10:12:38.292906    1552 actual_state_of_world.go:539] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="node-ed73e312-a98d-4a67-8620-e86869cab2d3" does not exist
2022-10-12T10:12:38+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1552]: I1012 10:12:38.296206    1552 shared_informer.go:247] Caches are synced for HPA
2022-10-12T10:12:38+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1552]: E1012 10:12:38.300840    1552 job_controller.go:533] syncing job: failed pod(s) detected for job key "cattle-system/apply-os-upgrader-on-node-ed73e312-a98d-4a67-8620-e86869c-7f760"
2022-10-12T10:12:38+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1552]: I1012 10:12:38.309274    1552 job_controller.go:498] enqueueing job cattle-system/apply-os-upgrader-on-node-ed73e312-a98d-4a67-8620-e86869c-7f760
2022-10-12T10:12:38+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1552]: E1012 10:12:38.310650    1552 job_controller.go:533] syncing job: tracking status: adding uncounted pods to status: Operation cannot be fulfilled on jobs.batch "apply-os-upgrader-on-node-ed73e312-a98d-4a67-8620-e86869c-7f760": the object has been modified; please apply your changes to the latest version and try again

cant check further (on a car currently) but seems something to have a look at maybe?

Oct 12 '22 13:10 Itxaka

also, was there a k3s restart in the first log? so the upgrade worked?

2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 systemd[1]: Stopping Lightweight Kubernetes...
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: I1012 10:11:18.984218    1818 health_controller.go:182] Shutting down HealthController RunCheck
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: I1012 10:11:18.987456    1818 secure_serving.go:311] Stopped listening on 127.0.0.1:10259
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: I1012 10:11:18.988182    1818 tlsconfig.go:255] "Shutting down DynamicServingCertificateController"
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: I1012 10:11:18.988467    1818 configmap_cafile_content.go:222] "Shutting down controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: I1012 10:11:18.989078    1818 requestheader_controller.go:183] Shutting down RequestHeaderAuthRequestController
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: I1012 10:11:18.989432    1818 dynamic_serving_content.go:145] "Shutting down controller" name="serving-cert::/var/lib/rancher/k3s/server/tls/kube-scheduler/kube-scheduler.crt::/var/lib/rancher/k3s/server/tls/kube-scheduler/kube-scheduler.key"
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: I1012 10:11:18.989590    1818 network_policy_controller.go:176] Shutting down network policies full sync goroutine
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: I1012 10:11:18.989853    1818 network_policy_controller.go:191] Shutting down network policies controller
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: I1012 10:11:18.990210    1818 configmap_cafile_content.go:222] "Shutting down controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: W1012 10:11:18.991021    1818 reflector.go:442] k8s.io/[email protected]/tools/cache/reflector.go:167: watch of *v1.Endpoints ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: W1012 10:11:18.991234    1818 reflector.go:442] k8s.io/[email protected]/tools/cache/reflector.go:167: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: I1012 10:11:18.992564    1818 watch.go:39] context canceled, close receiver chan
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: I1012 10:11:18.996728    1818 vxlan_network.go:75] evts chan closed
2022-10-12T10:11:18+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1818]: time="2022-10-12T10:11:18Z" level=info msg="Shutting down k3s.cattle.io/v1, Kind=Addon workers"
2022-10-12T10:11:19+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 systemd[1]: k3s.service: Deactivated successfully.
2022-10-12T10:11:19+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 systemd[1]: k3s.service: Unit process 2607 (containerd-shim) remains running after unit stopped.
2022-10-12T10:11:19+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 systemd[1]: k3s.service: Unit process 2637 (containerd-shim) remains running after unit stopped.
2022-10-12T10:11:19+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 systemd[1]: k3s.service: Unit process 2685 (containerd-shim) remains running after unit stopped.
2022-10-12T10:11:19+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 systemd[1]: k3s.service: Unit process 3584 (containerd-shim) remains running after unit stopped.
2022-10-12T10:11:19+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 systemd[1]: k3s.service: Unit process 3740 (containerd-shim) remains running after unit stopped.
2022-10-12T10:11:19+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 systemd[1]: k3s.service: Unit process 4431 (containerd-shim) remains running after unit stopped.
2022-10-12T10:11:19+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 systemd[1]: k3s.service: Unit process 5579 (containerd-shim) remains running after unit stopped.
2022-10-12T10:11:19+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 systemd[1]: k3s.service: Unit process 7200 (containerd-shim) remains running after unit stopped.
2022-10-12T10:11:19+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 systemd[1]: k3s.service: Unit process 7325 (containerd-shim) remains running after unit stopped.
2022-10-12T10:11:19+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 systemd[1]: Stopped Lightweight Kubernetes.
-- Boot 3e2343872bc84137b79bdcc50ba6d4a3 --
2022-10-12T10:11:53+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 systemd[1]: Starting Lightweight Kubernetes...
2022-10-12T10:11:53+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 sh[1522]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
2022-10-12T10:11:53+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 sh[1523]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
2022-10-12T10:11:54+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1552]: time="2022-10-12T10:11:54Z" level=info msg="Starting k3s v1.23.10+k3s1 (826b9498)"
2022-10-12T10:11:54+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1552]: time="2022-10-12T10:11:54Z" level=info msg="Managed etcd cluster bootstrap already complete and initialized"
2022-10-12T10:11:54+0000 node-ed73e312-a98d-4a67-8620-e86869cab2d3 k3s[1552]: time="2022-10-12T10:11:54Z" level=info msg="Starting temporary etcd to reconcile with datastore"

Also something to check, why did k3s restarted? Was it as part of the upgrade?

Oct 12 '22 14:10 Itxaka

My assumption about etcd snapshot was wrong, the test fails with it disabled.

Oct 12 '22 14:10 juadk

@Itxaka Anyway, I should be able to give you access to the runner tomorrow, I just have to prepare it 20 minutes before.

I see this message in the log and I don't understand what does it means... elemental-operator.version:time="2022-10-12T09:50:40Z" level=info msg="Starting controller at namespace fake. Upgrade sync interval at: 1h0m0s" I don't like the Upgrade sync interval message

Oct 12 '22 15:10 juadk

@Itxaka Anyway, I should be able to give you access to the runner tomorrow, I just have to prepare it 20 minutes before.

I see this message in the log and I don't understand what does it means... elemental-operator.version:time="2022-10-12T09:50:40Z" level=info msg="Starting controller at namespace fake. Upgrade sync interval at: 1h0m0s" I don't like the Upgrade sync interval message

The upgrade sync should only affect the OsVersionChannel syncer. The message is a bit wrong indeed, it should say syncer or os version syncer as that really doesnt have anything to do with the upgrade, other than checking for versions.

Oct 13 '22 07:10 Itxaka

Lets do some debug steps first:

Install rancher with debug=on
Install the operator with debug=on

Then lets check when the upgrade was created (timestamp) and wait for it to either fail or succeed (20 minutes later?) then once that its done gather all logs including the extra ones from the rancher pod. Maybe the node was not ready, or there was something rancher is doing in the background that forbids the upgrade from happening.

In any case, this seems to be more of a rancher issue as once the upgrade is created, its out of the operator hands and the rancher workflow is the one that deals wiht the upgrade. But we may be missing some stuff in our side? labels or taints or something like that.

Also we need:

k8s version
k3s/rke2 version (UI tests are k3s only if Im reading this correctly?)

Alos Im wondering what elemental-support version you used? It should have tried to pick up the resources from k8s but those were empty...maybe its broken, so ti would be helpful to see why. Could you also post the output from elemental-support ?

Oct 13 '22 07:10 Itxaka

diff --git a/tests/e2e/install_test.go b/tests/e2e/install_test.go
index 29c7c93..50331a5 100644
--- a/tests/e2e/install_test.go
+++ b/tests/e2e/install_test.go
@@ -120,6 +120,7 @@ var _ = Describe("E2E - Install Rancher", Label("install"), func() {
                                "--set", "extraEnv[2].value=https://releases.rancher.com/dashboard/"+uiVersion+"/index.html",
                                "--set", "extraEnv[3].name=CATTLE_UI_OFFLINE_PREFERRED",
                                "--set", "extraEnv[3].value=Remote",
+                               "--set", "debug=true",
                        )
                        Expect(err).To(Not(HaveOccurred()))
 
@@ -142,6 +143,7 @@ var _ = Describe("E2E - Install Rancher", Label("install"), func() {
 
                        err = kubectl.RunHelmBinaryWithCustomErr("install", "elemental-operator", "elemental-operator/elemental-operator",
                                "--namespace", "cattle-elemental-system",
+                               "--set", "debug=true",
                                "--create-namespace",
                        )
                        Expect(err).To(Not(HaveOccurred()))

Oct 13 '22 07:10 Itxaka

Maybe we should set the k3s to debug as well to see what is going on during that 20 minute hole?

Oct 13 '22 08:10 Itxaka

yes thanks, on my way! And yes, UI tests are k3s only, UI dev is working to add RKE.

Oct 13 '22 09:10 juadk

About the elemental-support question:

rancher-2458:~ # elemental-support 
INFO[0000] Support version v0.6.1, commit 9c1d506fa944fcd4f7afe96db2dd2853f0427b61, commit date 2022-10-11T15:26:10Z 
INFO[0000] Getting elemental-support version            
INFO[0000] Copying /etc/os-release                      
INFO[0000] Copying /etc/resolv.conf                     
INFO[0000] Copying /etc/hostname                        
INFO[0000] Copying /etc/rancher/agent/config.yaml       
INFO[0000] Copying /etc/rancher/agent/config.yaml       
INFO[0000] Copying dir /var/lib/elemental/agent/applied/ 
INFO[0000] Copying dir /var/lib/rancher/agent/applied/  
INFO[0000] Copying dir /system/oem                      
INFO[0000] Copying dir /oem/                            
INFO[0000] Getting service log elemental-system-agent   
INFO[0000] Getting service log rancher-system-agent     
INFO[0000] Getting service log k3s                      
INFO[0000] Getting service log rke2                     
INFO[0000] Getting service log cos-setup-boot           
INFO[0000] Getting service log cos-setup-fs             
INFO[0000] Getting service log cos-setup-initramfs      
INFO[0000] Getting service log cos-setup-network        
INFO[0000] Getting service log cos-setup-reconcile      
INFO[0000] Getting service log cos-setup-rootfs         
INFO[0000] Getting service log cos-immutable-rootfs     
INFO[0000] Getting service log elemental                
INFO[0000] Getting elemental-cli version                
INFO[0000] Getting elemental-operator version           
INFO[0000] Getting elemental-register version           
INFO[0000] Getting k8s info for pods                    
INFO[0000] Getting k8s info for secrets                 
INFO[0000] Getting k8s info for nodes                   
INFO[0000] Getting k8s info for services                
INFO[0000] Getting k8s info for deployments             
INFO[0000] Getting k8s logs for namespace cattle-system 
INFO[0000] Getting k8s logs for namespace kube-system   
INFO[0000] Getting k8s logs for namespace ingress-nginx 
INFO[0000] Creating final file                          
INFO[0000] All done. File created at node-2b643c72-a5c9-47f2-9ed9-b96c2a620486-2022-10-13T11:07:16Z.tar.gz

Oct 13 '22 11:10 juadk

For information, I did a test with a powerful machine and I still face the issue (16GB RAM + 8 VCPUS, kvm host is n2-standard-16), I'm collecting all logs except k3s debug( I forgot that one...)

Oct 13 '22 11:10 juadk

@Itxaka This is what you asked: elemental-operator.log rancher-debug.log rancher-system-agent.service.log elemental-system-agent.service.log node-2b643c72-a5c9-47f2-9ed9-b96c2a620486-2022-10-13T11:07:16Z.tar.gz

The upgrade has been triggered at 11:01:16 and the elemental node rebooted at 11:30 right after this message in rancher-system-agent.service:

Oct 13 11:30:22 node-2b643c72-a5c9-47f2-9ed9-b96c2a620486 rancher-system-agent[7820]: time="2022-10-13T11:30:22Z" level=info msg="[Applyinator] Command sh [-c k3s etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"

K3s version where ranchier is installed: v1.23.10+k3s1 K3s version for the elemental node: v1.23.10+k3s1

I'm giving it a try with latest k3s.

Oct 13 '22 11:10 juadk

umm, on the rancher side at 11:05:46 you can se a PATCH call to the elemental cluster that sets the Ready to false:

{
  "kind": "Cluster",
  "apiVersion": "cluster.x-k8s.io/v1beta1",
  "metadata": {
    "name": "myelementalcluster",
    "namespace": "fleet-default",
    "creationTimestamp": null,
    "labels": {
      "objectset.rio.cattle.io/hash": "b526ebb2a1f8f10b659228d09b2a7a50b5671076"
    },
    "annotations": {
      "objectset.rio.cattle.io/applied": "H4sIAAAAAAAA/5ySQY/TMBCF/wqas1OSlDRdS5zKnvYAqhAXxGFsT7Ymjh3Zk6JVlf+OHC2iu6il2mMm82be+zwnGIjRICPIE6D3gZFt8Cl/BvWTNCfiVbRhpZHZ0cqG99aAhNhTod2UmCKIi63hl6dYPB57kDDGcLTJBm/941nLsRLvHqw3H3c3TvM4EEgYnsjRQJ7R3eojK9OIOss7R8SFoQ4nxzAL0JGW6F/tQIlxGEH6yTkBDhW5q0AOmA4gQTX1hpSqseq2XVWqTXNX11tT3qkaW2xK1Wzaqmw3edu1EJdtCliC7KmjSF5TAvn9BDjabxQz2WuUQYByQfef84RP5IgXAceJBOjgOQbnKP6p9NbnZ969cHXJ8bScRL0uW2ywLdpmTcWHqjKFak1ZmA1t1+uq2TblFuYfs4A0ks5An9d+cejp3psxWM+5fgiJQQIIGENkkOUsXvTuqVvu9Tx47Ol13ucM+4f73Zn4P1muHon1XcTEcdI8xTfYuIHmnPkw8pReE9oTmieQHbpE/1r5+2+efwcAAP//fgTIOtcDAAA",
      "objectset.rio.cattle.io/id": "rke-cluster",
      "objectset.rio.cattle.io/owner-gvk": "provisioning.cattle.io/v1, Kind=Cluster",
      "objectset.rio.cattle.io/owner-name": "myelementalcluster",
      "objectset.rio.cattle.io/owner-namespace": "fleet-default"
    },
    "ownerReferences": [
      {
        "apiVersion": "provisioning.cattle.io/v1",
        "kind": "Cluster",
        "name": "myelementalcluster",
        "uid": "2307a5a7-753e-411d-b7d0-d6e833158508",
        "controller": true,
        "blockOwnerDeletion": true
      }
    ]
  },
  "spec": {
    "controlPlaneEndpoint": {
      "host": "",
      "port": 0
    },
    "controlPlaneRef": {
      "kind": "RKEControlPlane",
      "namespace": "fleet-default",
      "name": "myelementalcluster",
      "apiVersion": "rke.cattle.io/v1"
    },
    "infrastructureRef": {
      "kind": "RKECluster",
      "name": "myelementalcluster",
      "apiVersion": "rke.cattle.io/v1"
    }
  },
  "status": {
    "infrastructureReady": false,
    "controlPlaneReady": false
  }
}

Which leads me to believe that at this point the upgrade has started for sure, or the cluster is down for a different version.

Oct 13 '22 12:10 Itxaka

the k3s.log on from elemental-support only has logs up to 11:00? Also, could you get the logs from teh system-upgrade-controller ? somehow the support binary doesnt get the pods logs :/

Oct 13 '22 12:10 Itxaka

I did a test with latest k3s version (1.24.4+k3s1) and it passed but as it's sporadic, I need to run it 3 or 4 times... https://github.com/rancher/elemental/actions/runs/3242250673/jobs/5315262321

Oct 13 '22 12:10 juadk

The upgrade has been triggered at 11:01:16 and the elemental node rebooted at 11:30

The upgrade trigger happened at the same time when elemental-system-agent got stopped and rancher-system-agent took over. I wouldn't be surprised if this handover is the problem.

What happens if you wait 5 or 10 minutes (until after elemental-system-agent and rancher-system-agent have all their ducks in a row) before you trigger the update ?

Oct 13 '22 13:10 kkaempf

ok, it also failed with latest k3s version (k3s version v1.24.4+k3s1) https://github.com/rancher/elemental/actions/runs/3242457916/jobs/5315757980

Oct 13 '22 13:10 juadk

elemental elemental copied to clipboard

Upgrade does not triggered (sometimes)

elemental
elemental copied to clipboard