fleet icon indicating copy to clipboard operation
fleet copied to clipboard

New job for GitRepo is created and terminated every 3rd second

Open Marza opened this issue 1 year ago • 1 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

We have a Rancher installation (version 2.9.1), this problem likely started after upgrading from 2.8.x. We have 3 GitRepos, but only one of them are experiencing this problem. All point to the same Git repository in BitBucket but with different paths. We run on EKS 1.28 currently but plan on upgrading to EKS 1.29 soon.

For one of these GitRepos a job/pod is created roughly every 3 seconds and then it is terminated (usually), but sometimes they get stuck and we run out of IP-addresses in the subnet. The other GitRepos only see new jobs occasionally or when changes are done to the backing Git repository.

The GitRepo with problem also has this warning/error which we don't understand why it is there:

User "system:serviceaccount:cattle-fleet-system:fleet-controller" cannot create resource "gitjobs" in API group "gitjob.cattle.io" in the namespace ""

Expected Behavior

Pods are not created every 3rd second.

Steps To Reproduce

No response

Environment

- Architecture: amd64
- Fleet Version: v0.10.1
- Cluster:
  - Provider: EKS
  - Options: 4 nodes, master node running fleet-controller is c6i.12xlarge to accommodate the number of clusters. Ingress Nginx, AWS LB.
  - Kubernetes Version: 1.28

Logs

stream logs failed container "fleet" in pod "<gitjob>-f0e21-g5cjr" is waiting to start: PodInitializing for <namespace>/<gitjob>-f0e21-g5cjr (fleet)
gitcloner-initializer time="2024-09-16T14:03:58Z" level=warning msg="signal received: \"terminated\", canceling context..."
Stream closed EOF for <namespace>/<gitjob>-f0e21-g5cjr (gitcloner-initializer)

Anything else?

We see a lot of logs like this even though no changes are made to the backing Git repo in Bitbucket.

{"level":"info","ts":"2024-09-16T14:07:30Z","logger":"clustergroup-cluster-handler","msg":"Cluster changed, enqueue matching cluster groups","namespace":"<namespace>","name":"cluster-8cf77d5971e8"}

Marza avatar Sep 16 '24 14:09 Marza

The problem is that the GitRepo keeps the latest commit hash from the backing git repository, but that commit hash is wrong and isn't updated correctly. Initially it was blank, but after re-creating the GitRepo it first worked but then got stuck soon afterwards.

Since the commit hash is wrong Rancher fleet thinks there are changes all the time and tries to trigger updates.

Marza avatar Sep 27 '24 08:09 Marza

It looks like there may be an issue with the gitjob pod using an older Fleet image, as per:

User "system:serviceaccount:cattle-fleet-system:fleet-controller" cannot create resource "gitjobs" in API group "gitjob.cattle.io" in the namespace ""

While the gitjob pod still exists as part of Fleet controller deployments, the gitjob resource (CRD) itself has been removed in Fleet 0.10, and is no longer needed to create jobs for GitRepos.

That doesn't explain why this issue would only happen for only one GitRepo though... Do other GitRepos live in the same management cluster as the failing one? Which fleet container image version is in use in the gitjob pod? What does helm list -A output on the management cluster(s)?

weyfonk avatar Oct 07 '24 11:10 weyfonk

Yes, all GitRepos are in the same cluster and namespace. Fleet-controller is using rancher/fleet:v0.10.2, same version for the gitjob pods.

% helm list -A
NAME                        	NAMESPACE                      	REVISION	UPDATED                                 	STATUS  	CHART                                                                                   	APP VERSION
aws-load-balancer-controller	kube-system                    	4       	2024-05-15 15:48:06.58813 +0200 CEST    	deployed	aws-load-balancer-controller-1.7.2                                                      	v2.7.2     
external-dns                	utilities                      	6       	2024-09-05 16:20:05.551489313 +0200 CEST	deployed	external-dns-7.3.2                                                                      	0.14.1     
fleet                       	cattle-fleet-system            	20      	2024-09-27 10:18:06.896019517 +0000 UTC 	deployed	fleet-104.0.2+up0.10.2                                                                  	0.10.2     
fleet-agent-local           	cattle-fleet-local-system      	2423    	2024-09-27 10:22:57.650569942 +0000 UTC 	deployed	fleet-agent-local-v0.0.0+s-766b73b65b86b4bc4c0dffcec2736a376793eda8e9de6434b95f17156588e	           
fleet-crd                   	cattle-fleet-system            	16      	2024-09-27 10:18:00.100705846 +0000 UTC 	deployed	fleet-crd-104.0.2+up0.10.2                                                              	0.10.2     
ingress-nginx               	utilities                      	14      	2024-05-20 07:42:54.078454 +0200 CEST   	deployed	ingress-nginx-4.10.1                                                                    	1.10.1     
prometheus                  	monitoring                     	5       	2024-09-05 16:23:58.746117659 +0200 CEST	deployed	prometheus-25.8.2                                                                       	v2.48.1    
rancher                     	cattle-system                  	10      	2024-09-27 12:16:45.760551105 +0200 CEST	deployed	rancher-2.9.2                                                                           	v2.9.2     
rancher-backup              	cattle-resources-system        	1       	2022-06-08 06:32:32.270630933 +0000 UTC 	deployed	rancher-backup-2.1.2                                                                    	2.1.2      
rancher-backup-crd          	cattle-resources-system        	1       	2022-06-08 06:32:29.534008213 +0000 UTC 	deployed	rancher-backup-crd-2.1.2                                                                	2.1.2      
rancher-provisioning-capi   	cattle-provisioning-capi-system	4       	2024-09-11 10:20:47.400371031 +0000 UTC 	deployed	rancher-provisioning-capi-104.0.0+up0.3.0                                               	1.7.3      
rancher-webhook             	cattle-system                  	13      	2024-09-27 10:18:25.028320785 +0000 UTC 	deployed	rancher-webhook-104.0.2+up0.5.2                                                         	0.5.2      

Installed using Rancher 2.9.2 helm chart.

Marza avatar Oct 07 '24 12:10 Marza

We have upgraded to 2.9.2 since raising this issue, but the problem still exists, then main GitRepo shows wrong git commit hash.

Marza avatar Oct 07 '24 12:10 Marza

Cleaning up the backlog, we can't reproduce this.

manno avatar Oct 23 '24 13:10 manno