fleet icon indicating copy to clipboard operation
fleet copied to clipboard

fleet-agent-register crashes maybe with missing leases

Open marthydavid opened this issue 9 months ago • 19 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Current Behavior

0.12.0 fleet-agent-0 Init:crashloopbackoff within fleet-agent-register

$ kubectl get pods -n cattle-fleet-local-system
NAME            READY   STATUS                  RESTARTS        AGE
fleet-agent-0   0/2     Init:CrashLoopBackOff   5 (2m29s ago)   5m33s

Expected Behavior

No response

Steps To Reproduce

Upgrade rancher 2.11.0 from 2.10.3

Environment

- Architecture: arm64
- Fleet Version: 0.12.0
- Cluster:
  - Provider: k3s
  - Options: Deployed by rancher-2.11.0
  - Kubernetes Version: v1.31.3+k3s1

Logs

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2042b7c]
goroutine 1 [running]:
github.com/rancher/fleet/internal/cmd/agent.(*FleetAgent).Run(0x4000b2fa40, 0x4000b42008, {0x0?, 0x0?, 0x0?})
	/home/runner/_work/fleet/fleet/internal/cmd/agent/root.go:123 +0x33c
github.com/rancher/fleet/internal/cmd.Command.bind.func4(0x4000b42008, {0x40003c18b0, 0x1, 0x1})
	/home/runner/_work/fleet/fleet/internal/cmd/builder.go:272 +0xf8
github.com/spf13/cobra.(*Command).execute(0x4000b42008, {0x400004c0b0, 0x1, 0x1})
	/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1015 +0x82c
github.com/spf13/cobra.(*Command).ExecuteC(0x4000b42008)
	/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1148 +0x350
github.com/spf13/cobra.(*Command).Execute(...)
	/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1071
github.com/spf13/cobra.(*Command).ExecuteContext(0x4000078738?, {0x2d804e0?, 0x4000b2f9f0?})
	/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1064 +0x48
main.main()
	/home/runner/_work/fleet/fleet/cmd/fleetagent/main.go:16 +0x34

Anything else?

No response

marthydavid avatar Apr 03 '25 05:04 marthydavid

I have the same after upgrading rancher from 2.10.3 to 2.11.0 with an amd64 v1.31.7+rke2r1

berkerol avatar Apr 03 '25 11:04 berkerol

We had a similar problem.

In our case, the gitjob container also didn't start (or crashloop). The logs hinted us in the Direction, that some Gitrepo resources were faulty after the update.

In the Status of some there were

perClusterState: 
- {}

I've changed them all to

perClusterState: {}

important: to update the status, you need to use the --subresource=status flag.

After that, everything worked fine again for us. Might be a completely other Problem at your end, but maybe that helps.

BiggA94 avatar Apr 03 '25 13:04 BiggA94

We also have a same issue since the upgrade.

amaxi avatar Apr 03 '25 13:04 amaxi

We had a similar problem.

In our case, the gitjob container also didn't start (or crashloop). The logs hinted us in the Direction, that some Gitrepo resources were faulty after the update.

In the Status of some there were

perClusterState: 
- {}

I've changed them all to

perClusterState: {}

important: to update the status, you need to use the --subresource=status flag.

After that, everything worked fine again for us. Might be a completely other Problem at your end, but maybe that helps.

worked for me as well. For those struggling how to fix: check for faulty resources:

for i in $(kubectl get GitRepo -n fleet-default | awk '{ print $1 }'); do kubectl get GitRepo -n fleet-default $i -o yaml | grep -A2 perClusterState && echo $i; done

if you see it as

perClusterState: 
 - {}

get the resource: kubectl get GitRepo -n fleet-default [resource] -o yaml > cr.yaml, edit status and:

  1. fix perClusterState to perClusterState: {}
  2. ADD perClusterState: {} to each other status resources.

apply it: kubectl replace --subresource status -f cr.yaml

xhejtman avatar Apr 05 '25 10:04 xhejtman

I applied this solution (perClusterState) but afterwards I am getting following error in one of the fleet bundles

ErrApplied(1) [Cluster fleet-local/local: content does not match hash got .... expected ....

Even though I do "Force Update" in UI or delete all fleet pods, it doesn't disappear. All the pods from fleet-infra jobs are failed with context cancelled. All the jobs fail after processing the same bundle. If someone has ideas on how to solve this, it would be appreciated. I think I can't delete gitrepo because all the bundles and helm releases will be deleted too.

berkerol avatar Apr 08 '25 08:04 berkerol

I can confirm that this fixed our issues as well. Great catch !

tzalistar avatar Apr 09 '25 06:04 tzalistar

We have the same issue, but the fix does not work for us as we do not have any resources of type GitRepo

error: the server doesn't have a resource type "GitRepo"

Also, I couldn't find any gitjob container. We upgraded from rancher 2.9 to 2.11.0 and are running RKE2 v1.31.6+rke2r1

rikvb avatar Apr 09 '25 13:04 rikvb

it also seems that this version broken repos that conatained inlined helm packages, i.e., Chart.yaml, templates, values.yaml files are directly in the repo. It worked before.

xhejtman avatar Apr 09 '25 14:04 xhejtman

We have also faced the same issue after the upgrade from 2.10.3 to 2.11.0.

gersangreal avatar Apr 10 '25 11:04 gersangreal

it also seems that this version broken repos that conatained inlined helm packages, i.e., Chart.yaml, templates, values.yaml files are directly in the repo. It worked before.

can be fixed by adding: chart: . to the helm section.

xhejtman avatar Apr 10 '25 11:04 xhejtman

it also seems that this version broken repos that conatained inlined helm packages, i.e., Chart.yaml, templates, values.yaml files are directly in the repo. It worked before.

can be fixed by adding: chart: . to the helm section.

This would be a different issue. Can you create one and provide more information about your fleet.yaml, repo layout? I was under the impression https://github.com/rancher/fleet-examples/tree/master/single-cluster/helm still works.

manno avatar Apr 10 '25 12:04 manno

it also seems that this version broken repos that conatained inlined helm packages, i.e., Chart.yaml, templates, values.yaml files are directly in the repo. It worked before.

can be fixed by adding: chart: . to the helm section.

This would be a different issue. Can you create one and provide more information about your fleet.yaml, repo layout? I was under the impression https://github.com/rancher/fleet-examples/tree/master/single-cluster/helm still works.

this is sample repo that needs now chart: .

tesk-nauth.zip

xhejtman avatar Apr 10 '25 13:04 xhejtman

This issue now discusses three different bugs:

  • missing leader election options lead to a panic in the agent: https://github.com/rancher/fleet/pull/3534
  • we may have accidentally broken the API. GitRepo.Status.Resources.PerClusterState changed from list to map 😱
  • issue with chart: . in fleet.yaml, maybe when used together with target customization, cannot reproduce yet: https://github.com/manno/fleet-experiments/blob/main/internal-chart/fleet.yaml

Let's keep this issue about the agent panic, please.

manno avatar Apr 10 '25 13:04 manno

Was the upgrade done with --reuse-values? Normally the environment variables should be set in the fleet-agent deployment. Force redeploying the agent should also work, if the values are updated: https://fleet.rancher.io/troubleshooting#agent-is-no-longer-registered

When upgrading manually with --reuse-values, --set can be added.

manno avatar Apr 10 '25 15:04 manno

Was the upgrade done with --reuse-values? Normally the environment variables should be set in the fleet-agent deployment. Force redeploying the agent should also work, if the values are updated: https://fleet.rancher.io/troubleshooting#agent-is-no-longer-registered

When upgrading manually with --reuse-values, --set can be added.

If you meant whole rancher upgrade, this is my upgrade command:

 helm upgrade rancher rancher-latest/rancher --namespace cattle-system --set hostname=rancher.cloud.e-infra.cz --set ingress.tls.source=secret --set ingress.tls.secretName=rancher-cloud-e-infra-cz --set bootstrapPassword=xxx --set ingress.extraAnnotations.'kubernetes\.io/ingress\.class'=nginx --set ingress.extraAnnotations.'kubernetes\.io/tls-acme'=\"true\" --set ingress.extraAnnotations.'cert-manager\.io/cluster-issuer'=letsencrypt-prod --set customLogos.enabled=true --set customLogos.volumeName=pvc-rancher  --set antiAffinity=preferred --set auditLog.level=3 --set global.cattle.psp.enabled=false --version 2.11.0 --set 'proxy=http://proxy.ics.muni.cz:3128' --set 'noProxy=127.0.0.0/8\,10.0.0.0/8\,cattle-system.svc\,.svc\,.cluster.local\,147.251.0.0/16\,2001:718:801::/48\,10.43.0.1\,10.43.0.0/16'

xhejtman avatar Apr 10 '25 15:04 xhejtman

I am not using --reuse-values in my helm upgrade command either. And I tried already force redeploying the agent. kubectl patch command returns no change (so it doesn't fix).

berkerol avatar Apr 10 '25 15:04 berkerol

I've tested with 106.0.1+up0.12.1-rc.1 and it looks good to me right now.

Should I close the issue now or wait for rancher-2.11.1?

marthydavid avatar Apr 12 '25 12:04 marthydavid

@marthydavid How do you specify fleet chart version while installing rancher? I tried --set fleet.chartVersion=106.0.1+up0.12.1-rc.1 and --set fleet.chartVersion=0.12.1-rc.1 and also with fleet.version but it didn't work.

berkerol avatar Apr 14 '25 10:04 berkerol

I've did it manually. Bumped the gitrepo to dev-2.11 in Applications tab on rancher UI.

Waited for repo sync and upgraded with hand the fleet charts

marthydavid avatar Apr 14 '25 12:04 marthydavid

The issues mentioned here which could be reproduced have been fixed. Closing.

weyfonk avatar Apr 30 '25 14:04 weyfonk

I can confirm 2.11.1 fixed my issues

marthydavid avatar Apr 30 '25 16:04 marthydavid