cluster-api-provider-openstack Creation of health monitors for Octavia backends which doesn't support this feature

/kind bug

What steps did you take and what happened:

Provision a Cluster with OVN as backend for Octavia.

A Load Balancer with OVN provider was created for the API Server and a monitor was provisioned for it, causing a provisioning status Error.

What did you expect to happen:

CAPO should skip monitor creation for OVN provider.

Anything else you would like to add:

OVN provider doesn't support the creation of health monitors(docs for further info). The actual implementation of getOrCreateMonitor method is sending a request to create a monitor and then checking if the monitor is Implemented by the provider or not, which causes a unwanted behavior on Openstack side.

The LBaaS API health monitor endpoint doesn't implement the status code that CAPO is checking on IsNotImplementedError(), which is 501.

I didn't found a way to check whether an Octavia backend has monitors implemented or not in the API itself. I think a way to go is to enhance this check or check on a list of the providers itself and if the provider is "ovn", for example, we skip the monitor creation.

func (s *Service) getOrCreateMonitor(openStackCluster *infrav1.OpenStackCluster, monitorName, poolID, lbID string) error {

...
	monitor, err = s.loadbalancerClient.CreateMonitor(monitorCreateOpts)
	// Skip creating monitor if it is not supported by Octavia provider
	if capoerrors.IsNotImplementedError(err) {
		record.Warnf(openStackCluster, "SkippedCreateMonitor", "Health Monitor is not created as it's not implemented with the current Octavia provider.")
		return nil
	}

Environment:

Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): v0.7.3
Cluster-API version: v1.4.4
Minikube/KIND version: N/A
Kubernetes version (use kubectl version):v1.23.10

Aug 14 '23 03:08 jfcavalcante

I got to unpack this a bit, see inline:

What did you expect to happen:

CAPO should skip monitor creation for OVN provider.

True. The true bug here is that on 501 Not Implemented the CAPO should just ignore creating the health monitor and only log a warning.

Anything else you would like to add:

OVN provider doesn't support the creation of health monitors(docs for further info). The actual implementation of getOrCreateMonitor method is sending a request to create a monitor and then checking if the monitor is Implemented by the provider or not, which causes a unwanted behavior on Openstack side.

It doesn't invalidate your issue, but newer OVN versions do support health monitors. The docs you link are pretty old.

The LBaaS API health monitor endpoint doesn't implement the status code that CAPO is checking on IsNotImplementedError(), which is 501.

I've just tested that it does:

mdulko:openshift-clusters/ $ curl -g -i --cacert "/home/mdulko/.config/openstack/standalone-ca.crt" -X POST https://10.1.8.104:13876/v2.0/lbaas/healthmonitors -H "Accept: application/json" -H "Content-Type: application/json" -H "User-Agent: openstacksdk/0.61.0 keystoneauth1/4.5.0 python-requests/2.28.1 CPython/3.10.12" -H "X-Auth-Token: <snip>" -d '{"healthmonitor": {"pool_id": "dfc549d0-1e0d-4777-a7f4-84393707cab7", "delay": 1, "timeout": 2, "max_retries": 3, "type": "TCP", "admin_state_up": true}}'
HTTP/1.1 501 Not Implemented
Date: Thu, 17 Aug 2023 14:58:03 GMT
Server: Apache
Content-Length: 169
x-openstack-request-id: req-3ee09fff-30fb-40bd-962c-7b4d56d1e82a
Content-Type: application/json

{"faultcode": "Server", "faultstring": "Provider 'ovn' does not support a requested action: This provider does not support creating health monitors.", "debuginfo": null}%

I think I'm using Train version of OVN Octavia Provider. Let's make sure we're on the same page here, maybe it's just the code comparing the exception is wrong. Had you checked that?

I didn't found a way to check whether an Octavia backend has monitors implemented or not in the API itself. I think a way to go is to enhance this check or check on a list of the providers itself and if the provider is "ovn", for example, we skip the monitor creation.

Yep, there's no safe way to check that other than just making a call and checking if it's 501.

Aug 17 '23 15:08 dulek

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 26 '24 14:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 25 '24 15:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 26 '24 16:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 26 '24 16:03 k8s-ci-robot