fleet-agent-register crashes maybe with missing leases
Is there an existing issue for this?
- [x] I have searched the existing issues
Current Behavior
0.12.0 fleet-agent-0 Init:crashloopbackoff within fleet-agent-register
$ kubectl get pods -n cattle-fleet-local-system
NAME READY STATUS RESTARTS AGE
fleet-agent-0 0/2 Init:CrashLoopBackOff 5 (2m29s ago) 5m33s
Expected Behavior
No response
Steps To Reproduce
Upgrade rancher 2.11.0 from 2.10.3
Environment
- Architecture: arm64
- Fleet Version: 0.12.0
- Cluster:
- Provider: k3s
- Options: Deployed by rancher-2.11.0
- Kubernetes Version: v1.31.3+k3s1
Logs
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2042b7c]
goroutine 1 [running]:
github.com/rancher/fleet/internal/cmd/agent.(*FleetAgent).Run(0x4000b2fa40, 0x4000b42008, {0x0?, 0x0?, 0x0?})
/home/runner/_work/fleet/fleet/internal/cmd/agent/root.go:123 +0x33c
github.com/rancher/fleet/internal/cmd.Command.bind.func4(0x4000b42008, {0x40003c18b0, 0x1, 0x1})
/home/runner/_work/fleet/fleet/internal/cmd/builder.go:272 +0xf8
github.com/spf13/cobra.(*Command).execute(0x4000b42008, {0x400004c0b0, 0x1, 0x1})
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1015 +0x82c
github.com/spf13/cobra.(*Command).ExecuteC(0x4000b42008)
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1148 +0x350
github.com/spf13/cobra.(*Command).Execute(...)
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1071
github.com/spf13/cobra.(*Command).ExecuteContext(0x4000078738?, {0x2d804e0?, 0x4000b2f9f0?})
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1064 +0x48
main.main()
/home/runner/_work/fleet/fleet/cmd/fleetagent/main.go:16 +0x34
Anything else?
No response
I have the same after upgrading rancher from 2.10.3 to 2.11.0 with an amd64 v1.31.7+rke2r1
We had a similar problem.
In our case, the gitjob container also didn't start (or crashloop). The logs hinted us in the Direction, that some Gitrepo resources were faulty after the update.
In the Status of some there were
perClusterState:
- {}
I've changed them all to
perClusterState: {}
important: to update the status, you need to use the --subresource=status flag.
After that, everything worked fine again for us. Might be a completely other Problem at your end, but maybe that helps.
We also have a same issue since the upgrade.
We had a similar problem.
In our case, the gitjob container also didn't start (or crashloop). The logs hinted us in the Direction, that some Gitrepo resources were faulty after the update.
In the Status of some there were
perClusterState: - {}I've changed them all to
perClusterState: {}important: to update the status, you need to use the --subresource=status flag.
After that, everything worked fine again for us. Might be a completely other Problem at your end, but maybe that helps.
worked for me as well. For those struggling how to fix: check for faulty resources:
for i in $(kubectl get GitRepo -n fleet-default | awk '{ print $1 }'); do kubectl get GitRepo -n fleet-default $i -o yaml | grep -A2 perClusterState && echo $i; done
if you see it as
perClusterState:
- {}
get the resource: kubectl get GitRepo -n fleet-default [resource] -o yaml > cr.yaml, edit status and:
- fix
perClusterStatetoperClusterState: {} - ADD
perClusterState: {}to each other status resources.
apply it: kubectl replace --subresource status -f cr.yaml
I applied this solution (perClusterState) but afterwards I am getting following error in one of the fleet bundles
ErrApplied(1) [Cluster fleet-local/local: content does not match hash got .... expected ....
Even though I do "Force Update" in UI or delete all fleet pods, it doesn't disappear. All the pods from fleet-infra jobs are failed with context cancelled. All the jobs fail after processing the same bundle. If someone has ideas on how to solve this, it would be appreciated. I think I can't delete gitrepo because all the bundles and helm releases will be deleted too.
I can confirm that this fixed our issues as well. Great catch !
We have the same issue, but the fix does not work for us as we do not have any resources of type GitRepo
error: the server doesn't have a resource type "GitRepo"
Also, I couldn't find any gitjob container. We upgraded from rancher 2.9 to 2.11.0 and are running RKE2 v1.31.6+rke2r1
it also seems that this version broken repos that conatained inlined helm packages, i.e., Chart.yaml, templates, values.yaml files are directly in the repo. It worked before.
We have also faced the same issue after the upgrade from 2.10.3 to 2.11.0.
it also seems that this version broken repos that conatained inlined helm packages, i.e., Chart.yaml, templates, values.yaml files are directly in the repo. It worked before.
can be fixed by adding: chart: . to the helm section.
it also seems that this version broken repos that conatained inlined helm packages, i.e., Chart.yaml, templates, values.yaml files are directly in the repo. It worked before.
can be fixed by adding:
chart: .to the helm section.
This would be a different issue. Can you create one and provide more information about your fleet.yaml, repo layout? I was under the impression https://github.com/rancher/fleet-examples/tree/master/single-cluster/helm still works.
it also seems that this version broken repos that conatained inlined helm packages, i.e., Chart.yaml, templates, values.yaml files are directly in the repo. It worked before.
can be fixed by adding:
chart: .to the helm section.This would be a different issue. Can you create one and provide more information about your fleet.yaml, repo layout? I was under the impression https://github.com/rancher/fleet-examples/tree/master/single-cluster/helm still works.
this is sample repo that needs now chart: .
This issue now discusses three different bugs:
- missing leader election options lead to a panic in the agent: https://github.com/rancher/fleet/pull/3534
- we may have accidentally broken the API.
GitRepo.Status.Resources.PerClusterStatechanged from list to map 😱 - issue with
chart: .in fleet.yaml, maybe when used together with target customization, cannot reproduce yet: https://github.com/manno/fleet-experiments/blob/main/internal-chart/fleet.yaml
Let's keep this issue about the agent panic, please.
Was the upgrade done with --reuse-values? Normally the environment variables should be set in the fleet-agent deployment.
Force redeploying the agent should also work, if the values are updated: https://fleet.rancher.io/troubleshooting#agent-is-no-longer-registered
When upgrading manually with --reuse-values, --set can be added.
Was the upgrade done with
--reuse-values? Normally the environment variables should be set in the fleet-agent deployment. Force redeploying the agent should also work, if the values are updated: https://fleet.rancher.io/troubleshooting#agent-is-no-longer-registeredWhen upgrading manually with
--reuse-values,--setcan be added.
If you meant whole rancher upgrade, this is my upgrade command:
helm upgrade rancher rancher-latest/rancher --namespace cattle-system --set hostname=rancher.cloud.e-infra.cz --set ingress.tls.source=secret --set ingress.tls.secretName=rancher-cloud-e-infra-cz --set bootstrapPassword=xxx --set ingress.extraAnnotations.'kubernetes\.io/ingress\.class'=nginx --set ingress.extraAnnotations.'kubernetes\.io/tls-acme'=\"true\" --set ingress.extraAnnotations.'cert-manager\.io/cluster-issuer'=letsencrypt-prod --set customLogos.enabled=true --set customLogos.volumeName=pvc-rancher --set antiAffinity=preferred --set auditLog.level=3 --set global.cattle.psp.enabled=false --version 2.11.0 --set 'proxy=http://proxy.ics.muni.cz:3128' --set 'noProxy=127.0.0.0/8\,10.0.0.0/8\,cattle-system.svc\,.svc\,.cluster.local\,147.251.0.0/16\,2001:718:801::/48\,10.43.0.1\,10.43.0.0/16'
I am not using --reuse-values in my helm upgrade command either. And I tried already force redeploying the agent. kubectl patch command returns no change (so it doesn't fix).
I've tested with 106.0.1+up0.12.1-rc.1 and it looks good to me right now.
Should I close the issue now or wait for rancher-2.11.1?
@marthydavid How do you specify fleet chart version while installing rancher? I tried --set fleet.chartVersion=106.0.1+up0.12.1-rc.1 and --set fleet.chartVersion=0.12.1-rc.1 and also with fleet.version but it didn't work.
I've did it manually. Bumped the gitrepo to dev-2.11 in Applications tab on rancher UI.
Waited for repo sync and upgraded with hand the fleet charts
The issues mentioned here which could be reproduced have been fixed. Closing.
I can confirm 2.11.1 fixed my issues