longhorn
longhorn copied to clipboard
[BUG] LH manager reboots due to the webhook is not ready
Describe the bug
Such log is observed in a newly installed Harvester, the LH manager has more restarts due to webhook is not ready, but the webhook is embedded in LH manager.
$ cat /var/log/pods/longhorn-system_longhorn-manager-bf65b_667f293e-71d3-4492-9bcc-909d33967f00/longhorn-manager/2.log
2024-02-15T10:37:39.311610684Z stderr F W0215 10:37:39.311432 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2024-02-15T10:37:39.312405631Z stderr F I0215 10:37:39.312333 1 shared_informer.go:311] Waiting for caches to sync for longhorn datastore
2024-02-15T10:37:39.412915357Z stderr F I0215 10:37:39.412802 1 shared_informer.go:318] Caches are synced for longhorn datastore
2024-02-15T10:37:39.412928944Z stderr F time="2024-02-15T10:37:39Z" level=info msg="Starting longhorn conversion webhook server" func=webhook.StartWebhook file="webhook.go:23"
2024-02-15T10:37:39.412934454Z stderr F time="2024-02-15T10:37:39Z" level=info msg="Waiting for conversion webhook to become ready" func=webhook.StartWebhook file="webhook.go:42"
2024-02-15T10:37:39.413590648Z stderr F time="2024-02-15T10:37:39Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9501/v1/healthz" func=webhook.StartWebhook file="webhook.go:54" error="Get \"https://localhost:9501/v1/healthz\": dial tcp [::1]:9501: connect: connection refused"
...
2024-02-15T10:37:43.441196092Z stderr F E0215 10:37:43.441132 1 leaderelection.go:308] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "longhorn-manager-upgrade-lock": the object has been modified; please apply your changes to the latest version and try again
2024-02-15T10:37:43.441208768Z stderr F time="2024-02-15T10:37:43Z" level=info msg="Upgrade leader lost: harv41" func=upgrade.upgrade.func2 file="upgrade.go:147"
2024-02-15T10:37:43.441273584Z stderr F time="2024-02-15T10:37:43Z" level=fatal msg="Error starting manager: upgrade API version failed: cannot create CRDAPIVersionSetting: Internal error occurred: failed calling webhook \"validator.longhorn.io\": failed to call webhook: Post \"https://longhorn-admission-webhook.longhorn-system.svc:9502/v1/webhook/validaton?timeout=10s\": no endpoints available for service \"longhorn-admission-webhook\"" func=main.main.DaemonCmd.func3 file="daemon.go:92"
$ grep "Starting longhorn" /var/log/pods/longhorn-system_longhorn-manager-bf65b_667f293e-71d3-4492-9bcc-909d33967f00/longhorn-manager/2.log
2024-02-15T10:37:39.412928944Z stderr F time="2024-02-15T10:37:39Z" level=info msg="Starting longhorn conversion webhook server" func=webhook.StartWebhook file="webhook.go:23"
2024-02-15T10:37:41.415249851Z stderr F time="2024-02-15T10:37:41Z" level=info msg="Starting longhorn admission webhook server" func=webhook.StartWebhook file="webhook.go:23"
2024-02-15T10:37:43.417662108Z stderr F time="2024-02-15T10:37:43Z" level=info msg="Starting longhorn recovery-backend server" func=recovery_backend.StartRecoveryBackend file="recovery_backend.go:13"
To Reproduce
- Install a Harvester v1.3.0 cluster
Expected behavior
Avoid such restart, let LH service to be quickly ready.
Support bundle for troubleshooting
Environment
- Longhorn version: v1.6.0
- Impacted volume (PV):
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
- Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
Additional context
Observed when debugging: https://github.com/harvester/harvester/issues/5109
The root cause is that we now have conversion-webhook(:9501
), admission-webhook(:9502
) in longhorn-manager
We set the readiness port to 9501
The order of starting is as follow
- conversion-webhook
- admission-webhook
- upgrade 4.....
So there is a chance that the pod is not ready yet (readiness on 9501
hasn't succeeded) when the longhorn-manager starts to do the upgrade
It then fails to call the webhook by endpoint because there is no longhorn-manager pod ready.
Post \"https://longhorn-admission-webhook.longhorn-system.svc:9502/v1/webhook/validaton?timeout=10s\"
I will fix it by validating the endpoint works before doing the upgrade.
Pre Ready-For-Testing Checklist
- [ ] Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:
- Freshly install the longhorn and make sure there is no longhorn-manager restarting because of failing to access webhook services.
PRs:
- https://github.com/longhorn/longhorn-manager/pull/2649
Verified pass on longhorn master(longhorn-manager ad7420
)
Install longhorn master several times, did not observe webhook caused longhorn-manager restart.