fleet [SURE-7783] Fleet 0.9.0 having trouble with bitbucket syncing

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

Gitjob in cattle-fleet-system namespace is giving following error for all the gitjobs after Rancher upgrade to 2.8.2

time="2024-02-28T11:00:35Z" level=error msg="Error fetching latest commit: Get "https://xxx/xposer.git/info/refs?service=git-upload-pack": context deadline exceeded"

If I do force resync it becomes active but after sometime it again goes to failed state with the above error

Expected Behavior

repo should sync without any error

Steps To Reproduce

No response

Environment

- Architecture:
- Fleet Version: 0.9.0
- Cluster:
  - Provider: EKS
  - Options:
  - Kubernetes Version: 1.26

Logs

No response

Anything else?

No response

Feb 28 '24 11:02 rajeshneo

Could be connected - Since a long time we are getting this log entry on Bitbucket side: [auth_basic:error] [pid xxxxxx] [client xxx.xxx.xxx.xxx:yyyyy] AH01617: user xxxxxx: authentication failure for "/bitbucket/xxxxx/xxxx/some-repo.git/info/refs": Password Mismatch

This happens quite often and is polluting the server logs considerably.

Mar 06 '24 12:03 mkindalov

I was not able to reproduce it. I tried like this:

Deployed Rancher 2.7.9 on k3s v1.26.10+k3s2
Deployed private Bitbucket repos both locally and on an RKE2 downstream cluster
Upgraded later to 2.8.2

No issues found:

Mar 07 '24 14:03 mmartin24

ok. After 75 mins approximately I was able to see some logs similar to the ones described. For a fraction of a moment the ui also displayed the errors on it, but after a bit they were gone

Pasting here the logs found:

time="2024-03-07T15:18:32Z" level=debug msg="Enqueueing gitjob fleet-local/bitbucket-local in 15 seconds"
time="2024-03-07T15:18:33Z" level=error msg="Error fetching latest commit: Get \"https://bitbucket.org/fleet-test-bitbucket/bitbucket-fleet-test/info/refs?service=git-upload-pack\": context deadline exceeded"
time="2024-03-07T15:18:33Z" level=debug msg="Enqueueing gitjob fleet-default/bit-butcket-local in 15 seconds"
E0307 15:18:44.117346       7 leaderelection.go:327] error retrieving resource lock cattle-fleet-system/gitjob: Get "https://10.43.0.1:443/api/v1/namespaces/cattle-fleet-system/configmaps/gitjob": context deadline exceeded
I0307 15:18:44.117363       7 leaderelection.go:280] failed to renew lease cattle-fleet-system/gitjob: timed out waiting for the condition
W0307 15:18:57.116748       7 reflector.go:456] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0307 15:18:57.116804       7 reflector.go:456] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: watch of *v1.Job ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0307 15:18:57.116807       7 reflector.go:456] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0307 15:18:57.116807       7 reflector.go:456] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: watch of *v1.GitJob ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
E0307 15:18:57.116841       7 leaderelection.go:303] Failed to release lock: Put "https://10.43.0.1:443/api/v1/namespaces/cattle-fleet-system/configmaps/gitjob": http2: client connection lost
time="2024-03-07T15:18:57Z" level=fatal msg="leaderelection lost for gitjob"

Mar 07 '24 15:03 mmartin24

We have over 150 gitjobs running on about 15 downstream clusters. This issue arises in just a few minutes and persists.

Mar 08 '24 06:03 rajeshneo

thanks @rajeshneo. Just a couple of questions: is the error gone if you force update the gitjobs? could you please increase the polling interval per git repo?

Mar 08 '24 12:03 mmartin24

It becomes active for 5 minutes or so and then returns to the error state with the same error message. I tried changing polling also using FLEET_CLUSTER_ENQUEUE_DELAY variable but no improvement.

Mar 08 '24 15:03 rajeshneo

Sorry, but I was not able to reproduce this issue on a steady manner using k3s as the main cluster and RKE2 as the downstream one. I was able to see the UI error message after disconnecting and reconnecting the clusters, but no logs pointing to the actual disconnection with the repos, and they were gone either when forcing an update or extending the pollingInterval to over 45 seconds.

Mar 12 '24 07:03 mmartin24

@mmartin24 Not sure if there are other users as well who are facing this. for now, I have downgraded my fleet to 0.8.0 version and everything is healthy again.

Mar 12 '24 12:03 rajeshneo

@mmartin24 Not sure if there are other users as well who are facing this. for now, I have downgraded my fleet to 0.8.0 version and everything is healthy again.

To my knowledge, this is the only ticket so far. I can see it has an internal Jira issue and it has been already queued to be addressed.

Mar 13 '24 10:03 mmartin24

Maybe this timeout is too small? https://github.com/rancher/fleet/blob/main/pkg/git/lsremote.go#L140C1-L140C31 Let's make it configurable.

https://github.com/rancher/fleet/issues/2224

Apr 10 '24 13:04 manno

Seeing the same thing here with git (cloud gitlab) on 2.8.2, so it's not specific to bitbucket. Have another cluster on 2.8.3 that we will evaluate on shortly.

May 01 '24 21:05 edqallen

Running on RKE2 1.28.8 and Rancher 2.8.3. This problem started appearing since Rancher 2.8.2, I believe. Fleet version is fleet:103.1.4+up0.9.4

Most of the pollingInterval on all the Fleet repos are set to 5 minutes here since the beginning. It does not help avoiding this problem.

A configurable Timeout might actually help here.

May 13 '24 14:05 jeroenhartgers

We've made the git client timeout configurable. You can find the value in the values.yaml file of the fleet chart.

Aug 05 '24 10:08 p-se

We need to figure out if an empty value in the config map shows the same behavior as in older versions.

Aug 12 '24 12:08 manno

We need to figure out if an empty value in the config map shows the same behavior as in older versions.

I don't think it behaves the same as in previous versions, however, I'm also not sure how to get there, since any helm install or helm upgrade would completely replace the ConfigMap and we have a default value of 30s in the values.yaml file of the chart.

Though, having a 0 as value for gitClientTimeout appears to be dangerous, which is why I've added a PR to set it to 30s instead.

Aug 20 '24 13:08 p-se

It shouldn't be possible to get to an empty value unintended, since an upgrade of Fleet with the functionality added in this issue would also update the ConfigMap used to configure this behavior. However, if the ConfigMap is edited directly and the value removed, it will have an empty value and assume 0 as value. The same is true if the value is set to 0 directly (either in the values.yaml for the Fleet helm chart or in the ConfigMap). 0 as a value can cause issues as it would effectively disable the timeout and make the client wait forever. To remedy this, another PR was created, which will set the timeout to 30s for every zero value it finds (and an empty value will be treated as a zero).

Sep 03 '24 11:09 p-se

The git client timeout is the amount of time to wait for a response from the server before canceling the request. It is sued to retrieve the latest commit of the configured git repositories. A zero value or a missing value will be treated as 30s.

Sep 03 '24 11:09 p-se