cluster-api-provider-vsphere CAPV 1.2.0 causes high number of sessions and 503 errors on vsphere

trafficstars

/kind bug

What steps did you take and what happened:

We have a clusterAPI management cluster using capv, currently managing 5 clusters and approximately 50 machines. We used to follow the releases of capv and encountered no errors in releases 1.0.x and 1.1.x

Since the capv update to 1.2.0, we saw the number of opened sessions on vsphere jumping from 2 to 20, and our vsphere seems to have difficulty handling the new session comportment, with some services crashing like the sso service, or the session service, resulting in 503 errors on capv calls :

E0617 14:57:59.360514       1 controller.go:317] controller/vspherecluster "msg"="Reconciler error" "error"="unexpected error while probing vcenter for infrastructure.cluster.x-k8s.io/v1beta1, Kind=VSphereCluster cluster-testinfra/cluster-testinfra: unable to create tags manager: POST https://sazvmk.sii24.pole-emploi.intra/rest/com/vmware/cis/session: 503 Service Unavailable" "name"="cluster-testinfra" "namespace"= "cluster-testinfra" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="VSphereCluster"
E0617 14:57:59.391434       1 controller.go:317] controller/vspherevm "msg"="Reconciler error" "error"="unable to create tags manager: POST https://sazvmk.sii24.pole-emploi.intra/rest/com/vmware/cis/session: 503 Service Unavailable" "name"="cluster-tdc-worker-md-0-6964f4bdfd-twxkm" "namespace"="cluster-tdc" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="VSphereVM"

What did you expect to happen:

No changes in the number of sessions opened and no pressure on vsphere services

Anything else you would like to add:

Seems like the session keep alive feature has been enabled by default, on v1.1.1 we didn't activate it, and in 1.2.0, we are unable to desactivate it to verify if it is the culprit.

Environment:

Cluster-api-provider-vsphere version: 1.2.0
Cluster API : 1.1.4
vSphere Client version : 6.7.0.51000
Kubernetes version:

Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:52:18Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}

OS (e.g. from /etc/os-release):

NAME="AlmaLinux"
VERSION="9.0 (Emerald Puma)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.0"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.0 (Emerald Puma)"
ANSI_COLOR="0;34"

Jun 22 '22 07:06 ugiraud

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 20 '22 08:09 k8s-triage-robot

/remove-lifecycle stale

Sep 21 '22 06:09 merriadoc

The CAPV version 1.3.x and above sets the --enable-keep-alive flag by default, so this should not be the case any more. For 1.2.x CAPV, you could enable this flag on the CAPV deployment by adding --enable-keep-alive to the CAPV deployment which should mitigate these errors as well. /lifecycle frozen /remove-lifecycle stale /priority awaiting-more-evidence

Nov 07 '22 20:11 srm09

cluster-api-provider-vsphere cluster-api-provider-vsphere copied to clipboard

CAPV 1.2.0 causes high number of sessions and 503 errors on vsphere

cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard