source-controller icon indicating copy to clipboard operation
source-controller copied to clipboard

`source-controller` appear to be hanging, checking a git repo over ssh

Open maargenton opened this issue 2 years ago • 5 comments

I have the source-controller configured to watch a single git repo over ssh, with an interval of 1 minute and no explicit timeout (should default to 60s). After a little while (about 10 minutes since reboot in my latest case), the source controller stops checking the repo, stops logging anything (logging bumped to debug to investigate), and never recovers from that state.

The kustomize-controller, configured to reconcile every 10 minutes keeps working / logging properly, but never sees any update after that point.

$ flux version
flux: v2.0.0-rc.5
helm-controller: v0.34.1
kustomize-controller: v1.0.0-rc.4
notification-controller: v1.0.0-rc.4
source-controller: v1.0.0-rc.5

from http://...:8080/metrics:

# HELP workqueue_unfinished_work_seconds How many seconds of work has been done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.
# TYPE workqueue_unfinished_work_seconds gauge
workqueue_unfinished_work_seconds{name="bucket"} 0
workqueue_unfinished_work_seconds{name="gitrepository"} 4992.46690147
workqueue_unfinished_work_seconds{name="helmchart"} 0
workqueue_unfinished_work_seconds{name="helmrepository"} 0
workqueue_unfinished_work_seconds{name="ocirepository"} 0

Additional context:

  • This is running on an Intel MacBook Pro, using vagrant to run a Ubuntu 2204 vm, itself running a single-node k3s Kubernetes that I use for development and experimentation.
  • I initially though this could be caused by some clock synchronization issue when the host goes to sleep, but in the latest instance, no sleep has occurred since vagrant up
  • The repository is a private repo on GitHub. Synchronizations and reconciliations are working fine most of the time.
  • I experienced some upstream connectivity issues earlier today, with some instances of failed reconciliations and timeouts. This could be related, with some code paths handling connectivity issues hanging instead of timing out. I had this setup (these versions) running for a couple of weeks, and it was working properly until recently, as far as I can tell.

I'll be happy to provide any further details if needed. Please let me know how I can help resolve this issue.

Thanks

maargenton avatar Jul 05 '23 07:07 maargenton