longhorn [BUG] LH manager reboots due to the webhook is not ready

Describe the bug

Such log is observed in a newly installed Harvester, the LH manager has more restarts due to webhook is not ready, but the webhook is embedded in LH manager.

$ cat /var/log/pods/longhorn-system_longhorn-manager-bf65b_667f293e-71d3-4492-9bcc-909d33967f00/longhorn-manager/2.log 
2024-02-15T10:37:39.311610684Z stderr F W0215 10:37:39.311432       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-02-15T10:37:39.312405631Z stderr F I0215 10:37:39.312333       1 shared_informer.go:311] Waiting for caches to sync for longhorn datastore
2024-02-15T10:37:39.412915357Z stderr F I0215 10:37:39.412802       1 shared_informer.go:318] Caches are synced for longhorn datastore
2024-02-15T10:37:39.412928944Z stderr F time="2024-02-15T10:37:39Z" level=info msg="Starting longhorn conversion webhook server" func=webhook.StartWebhook file="webhook.go:23"
2024-02-15T10:37:39.412934454Z stderr F time="2024-02-15T10:37:39Z" level=info msg="Waiting for conversion webhook to become ready" func=webhook.StartWebhook file="webhook.go:42"
2024-02-15T10:37:39.413590648Z stderr F time="2024-02-15T10:37:39Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9501/v1/healthz" func=webhook.StartWebhook file="webhook.go:54" error="Get \"https://localhost:9501/v1/healthz\": dial tcp [::1]:9501: connect: connection refused"
...
2024-02-15T10:37:43.441196092Z stderr F E0215 10:37:43.441132       1 leaderelection.go:308] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "longhorn-manager-upgrade-lock": the object has been modified; please apply your changes to the latest version and try again
2024-02-15T10:37:43.441208768Z stderr F time="2024-02-15T10:37:43Z" level=info msg="Upgrade leader lost: harv41" func=upgrade.upgrade.func2 file="upgrade.go:147"
2024-02-15T10:37:43.441273584Z stderr F time="2024-02-15T10:37:43Z" level=fatal msg="Error starting manager: upgrade API version failed: cannot create CRDAPIVersionSetting: Internal error occurred: failed calling webhook \"validator.longhorn.io\": failed to call webhook: Post \"https://longhorn-admission-webhook.longhorn-system.svc:9502/v1/webhook/validaton?timeout=10s\": no endpoints available for service \"longhorn-admission-webhook\"" func=main.main.DaemonCmd.func3 file="daemon.go:92"


$ grep "Starting longhorn" /var/log/pods/longhorn-system_longhorn-manager-bf65b_667f293e-71d3-4492-9bcc-909d33967f00/longhorn-manager/2.log 
2024-02-15T10:37:39.412928944Z stderr F time="2024-02-15T10:37:39Z" level=info msg="Starting longhorn conversion webhook server" func=webhook.StartWebhook file="webhook.go:23"
2024-02-15T10:37:41.415249851Z stderr F time="2024-02-15T10:37:41Z" level=info msg="Starting longhorn admission webhook server" func=webhook.StartWebhook file="webhook.go:23"
2024-02-15T10:37:43.417662108Z stderr F time="2024-02-15T10:37:43Z" level=info msg="Starting longhorn recovery-backend server" func=recovery_backend.StartRecoveryBackend file="recovery_backend.go:13"

To Reproduce

Install a Harvester v1.3.0 cluster

Expected behavior

Avoid such restart, let LH service to be quickly ready.

Support bundle for troubleshooting

Environment

Longhorn version: v1.6.0
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

Observed when debugging: https://github.com/harvester/harvester/issues/5109

Feb 22 '24 09:02 w13915984028

The root cause is that we now have conversion-webhook(:9501), admission-webhook(:9502) in longhorn-manager We set the readiness port to 9501 The order of starting is as follow

conversion-webhook
admission-webhook
upgrade 4.....

So there is a chance that the pod is not ready yet (readiness on 9501 hasn't succeeded) when the longhorn-manager starts to do the upgrade It then fails to call the webhook by endpoint because there is no longhorn-manager pod ready.

Post \"https://longhorn-admission-webhook.longhorn-system.svc:9502/v1/webhook/validaton?timeout=10s\"

I will fix it by validating the endpoint works before doing the upgrade.

Feb 26 '24 09:02 ChanYiLin

Pre Ready-For-Testing Checklist

[ ] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at:
- Freshly install the longhorn and make sure there is no longhorn-manager restarting because of failing to access webhook services.

PRs:

https://github.com/longhorn/longhorn-manager/pull/2649

Feb 26 '24 10:02 longhorn-io-github-bot

Verified pass on longhorn master(longhorn-manager ad7420)

Install longhorn master several times, did not observe webhook caused longhorn-manager restart.

May 29 '24 08:05 chriscchien

longhorn longhorn copied to clipboard

[BUG] LH manager reboots due to the webhook is not ready

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

Pre Ready-For-Testing Checklist

longhorn
longhorn copied to clipboard