cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard
CAPV 1.2.0 causes high number of sessions and 503 errors on vsphere
/kind bug
What steps did you take and what happened:
We have a clusterAPI management cluster using capv, currently managing 5 clusters and approximately 50 machines. We used to follow the releases of capv and encountered no errors in releases 1.0.x and 1.1.x
Since the capv update to 1.2.0, we saw the number of opened sessions on vsphere jumping from 2 to 20, and our vsphere seems to have difficulty handling the new session comportment, with some services crashing like the sso service, or the session service, resulting in 503 errors on capv calls :
E0617 14:57:59.360514 1 controller.go:317] controller/vspherecluster "msg"="Reconciler error" "error"="unexpected error while probing vcenter for infrastructure.cluster.x-k8s.io/v1beta1, Kind=VSphereCluster cluster-testinfra/cluster-testinfra: unable to create tags manager: POST https://sazvmk.sii24.pole-emploi.intra/rest/com/vmware/cis/session: 503 Service Unavailable" "name"="cluster-testinfra" "namespace"= "cluster-testinfra" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="VSphereCluster"
E0617 14:57:59.391434 1 controller.go:317] controller/vspherevm "msg"="Reconciler error" "error"="unable to create tags manager: POST https://sazvmk.sii24.pole-emploi.intra/rest/com/vmware/cis/session: 503 Service Unavailable" "name"="cluster-tdc-worker-md-0-6964f4bdfd-twxkm" "namespace"="cluster-tdc" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="VSphereVM"
What did you expect to happen:
No changes in the number of sessions opened and no pressure on vsphere services
Anything else you would like to add:
Seems like the session keep alive feature has been enabled by default, on v1.1.1 we didn't activate it, and in 1.2.0, we are unable to desactivate it to verify if it is the culprit.
Environment:
- Cluster-api-provider-vsphere version: 1.2.0
- Cluster API : 1.1.4
- vSphere Client version : 6.7.0.51000
- Kubernetes version:
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:52:18Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
- OS (e.g. from
/etc/os-release):
NAME="AlmaLinux"
VERSION="9.0 (Emerald Puma)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.0"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.0 (Emerald Puma)"
ANSI_COLOR="0;34"
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The CAPV version 1.3.x and above sets the --enable-keep-alive flag by default, so this should not be the case any more.
For 1.2.x CAPV, you could enable this flag on the CAPV deployment by adding --enable-keep-alive to the CAPV deployment which should mitigate these errors as well.
/lifecycle frozen
/remove-lifecycle stale
/priority awaiting-more-evidence