New job for GitRepo is created and terminated every 3rd second
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
We have a Rancher installation (version 2.9.1), this problem likely started after upgrading from 2.8.x. We have 3 GitRepos, but only one of them are experiencing this problem. All point to the same Git repository in BitBucket but with different paths. We run on EKS 1.28 currently but plan on upgrading to EKS 1.29 soon.
For one of these GitRepos a job/pod is created roughly every 3 seconds and then it is terminated (usually), but sometimes they get stuck and we run out of IP-addresses in the subnet. The other GitRepos only see new jobs occasionally or when changes are done to the backing Git repository.
The GitRepo with problem also has this warning/error which we don't understand why it is there:
User "system:serviceaccount:cattle-fleet-system:fleet-controller" cannot create resource "gitjobs" in API group "gitjob.cattle.io" in the namespace "
"
Expected Behavior
Pods are not created every 3rd second.
Steps To Reproduce
No response
Environment
- Architecture: amd64
- Fleet Version: v0.10.1
- Cluster:
- Provider: EKS
- Options: 4 nodes, master node running fleet-controller is c6i.12xlarge to accommodate the number of clusters. Ingress Nginx, AWS LB.
- Kubernetes Version: 1.28
Logs
stream logs failed container "fleet" in pod "<gitjob>-f0e21-g5cjr" is waiting to start: PodInitializing for <namespace>/<gitjob>-f0e21-g5cjr (fleet)
gitcloner-initializer time="2024-09-16T14:03:58Z" level=warning msg="signal received: \"terminated\", canceling context..."
Stream closed EOF for <namespace>/<gitjob>-f0e21-g5cjr (gitcloner-initializer)
Anything else?
We see a lot of logs like this even though no changes are made to the backing Git repo in Bitbucket.
{"level":"info","ts":"2024-09-16T14:07:30Z","logger":"clustergroup-cluster-handler","msg":"Cluster changed, enqueue matching cluster groups","namespace":"<namespace>","name":"cluster-8cf77d5971e8"}
The problem is that the GitRepo keeps the latest commit hash from the backing git repository, but that commit hash is wrong and isn't updated correctly. Initially it was blank, but after re-creating the GitRepo it first worked but then got stuck soon afterwards.
Since the commit hash is wrong Rancher fleet thinks there are changes all the time and tries to trigger updates.
It looks like there may be an issue with the gitjob pod using an older Fleet image, as per:
User "system:serviceaccount:cattle-fleet-system:fleet-controller" cannot create resource "gitjobs" in API group "gitjob.cattle.io" in the namespace ""
While the gitjob pod still exists as part of Fleet controller deployments, the gitjob resource (CRD) itself has been removed in Fleet 0.10, and is no longer needed to create jobs for GitRepos.
That doesn't explain why this issue would only happen for only one GitRepo though... Do other GitRepos live in the same management cluster as the failing one?
Which fleet container image version is in use in the gitjob pod?
What does helm list -A output on the management cluster(s)?
Yes, all GitRepos are in the same cluster and namespace. Fleet-controller is using rancher/fleet:v0.10.2, same version for the gitjob pods.
% helm list -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
aws-load-balancer-controller kube-system 4 2024-05-15 15:48:06.58813 +0200 CEST deployed aws-load-balancer-controller-1.7.2 v2.7.2
external-dns utilities 6 2024-09-05 16:20:05.551489313 +0200 CEST deployed external-dns-7.3.2 0.14.1
fleet cattle-fleet-system 20 2024-09-27 10:18:06.896019517 +0000 UTC deployed fleet-104.0.2+up0.10.2 0.10.2
fleet-agent-local cattle-fleet-local-system 2423 2024-09-27 10:22:57.650569942 +0000 UTC deployed fleet-agent-local-v0.0.0+s-766b73b65b86b4bc4c0dffcec2736a376793eda8e9de6434b95f17156588e
fleet-crd cattle-fleet-system 16 2024-09-27 10:18:00.100705846 +0000 UTC deployed fleet-crd-104.0.2+up0.10.2 0.10.2
ingress-nginx utilities 14 2024-05-20 07:42:54.078454 +0200 CEST deployed ingress-nginx-4.10.1 1.10.1
prometheus monitoring 5 2024-09-05 16:23:58.746117659 +0200 CEST deployed prometheus-25.8.2 v2.48.1
rancher cattle-system 10 2024-09-27 12:16:45.760551105 +0200 CEST deployed rancher-2.9.2 v2.9.2
rancher-backup cattle-resources-system 1 2022-06-08 06:32:32.270630933 +0000 UTC deployed rancher-backup-2.1.2 2.1.2
rancher-backup-crd cattle-resources-system 1 2022-06-08 06:32:29.534008213 +0000 UTC deployed rancher-backup-crd-2.1.2 2.1.2
rancher-provisioning-capi cattle-provisioning-capi-system 4 2024-09-11 10:20:47.400371031 +0000 UTC deployed rancher-provisioning-capi-104.0.0+up0.3.0 1.7.3
rancher-webhook cattle-system 13 2024-09-27 10:18:25.028320785 +0000 UTC deployed rancher-webhook-104.0.2+up0.5.2 0.5.2
Installed using Rancher 2.9.2 helm chart.
We have upgraded to 2.9.2 since raising this issue, but the problem still exists, then main GitRepo shows wrong git commit hash.
Cleaning up the backlog, we can't reproduce this.