[Feature Request] Implement LBaaS in Yaook
Yaook as a further implementation of SCS standards, does not support a standard conform load balancer, yet. We have to provide one. At this, the only requirement is to provide a OpenStack conform endpoint to the user. The behavior behind the sense does not matter.
Tasks:
- [ ] Evaluate options for LBaaS
- [ ] Extend Yaook to support standard conform LBaaS
This issue is related to #587, which standardizes mandatory and recommended IaaS Service and LBaaS should be part of it.
Evaluate options for LBaaS
FTR, here is one of the main problems that prevented integration of Octavia in Yaook so far: https://storyboard.openstack.org/#!/story/2007370#comment-153426
In Yaook, any database instances are running behind HAProxy instances. This does seem to lead to severe problems with Octavia in production, according to the linked issue.
We should at least consider having a look at improving Octavia and/or its integration as re-implementing the whole Octavia LBaaS v2 API using a different LB framework will be no easy feat either.
We should get in touch with @horazont and check if there were other issues observed with Octavia than the one mentioned above that would need to be addressed as well.
I had a discussion with @horazont about this:
- the upstream issue report^1 seems to suggest that the issue is a race condition between a) Octavia API instructing its workers via RPC and b) MariaDB syncing the database write of the Octavia API to the other replicas in conjunction with the workers attempting to read the entry while being scheduled to a different DB replica through HAProxy that has not yet received the sync
- however, @horazont said he is not convinced that this actually is the problem since the HAProxy instances are configured to always schedule the DB queries to the first DB replica in Yaook
- we should reproduce and analyze the issue
- there seems to be an OVN backend driver^2 for Octavia, we should have a look at that one too
- creating an Octavia alternative with full API compatibility is a huge task and we should try all other options of getting Octavia working correctly in Yaook first, I think
While discussing the topic in a small topic kickoff with @kgube, @josephineSei and @kitsudaiki we identified the following tasks:
- Get in touch with the relevant CSPs and check if the SCS reference implementation ever experienced issues like the one mentioned above^1.
- Research which subset of the Octavia API is actually used and strictly needed by the KaaS part of the SCS reference implementation.
- Identify all possible use cases that the Octavia API offers and how each can be tested.
- Implement an Octavia operator prototype for Yaook to integrate Octavia in Yaook.
- Test the Octavia integration in Yaook, try to reproduce the original issue^1 and find a fix for it.
Note that aside from the last point most of these tasks are independent and can be addressed in parallel.
@markus-hentsch In Kolla (used by OSISM by default to deploy OpenStack) there are 2 ways to access the MariaDB Galera cluster: HAPRoxy + ProxySQL. With both ways all nodes in a cluster access the database through the same node. This is the node that holds the primary IP address managed by Keepalived. If Keepalived is not used and the database is accessed otherwise and possibly not via only one node, I think Galera ensures that the information is identical on all nodes because Galera implements a multi-master cluster.
Started with a first prototypical octavia-operator for YAOOK. Reactivated the old issue on gitlab regarding the octavia integration ( https://gitlab.com/yaook/operator/-/issues/186 ) and create a new brauch for an octavia-operator ( https://gitlab.com/yaook/operator/-/tree/feature/add-octavia-operator ) and octavia docker-image for YAOOK ( https://gitlab.com/yaook/images/octavia/-/tree/feature/initial-version ) for the implementation.
Started with a first prototypical octavia-operator for YAOOK. Reactivated the old issue on gitlab regarding the octavia integration ( https://gitlab.com/yaook/operator/-/issues/186 ) and create a new brauch for an octavia-operator ( https://gitlab.com/yaook/operator/-/tree/feature/add-octavia-operator ) and octavia docker-image for YAOOK ( https://gitlab.com/yaook/images/octavia/-/tree/feature/initial-version ) for the implementation.
For Amphora images you can use https://github.com/osism/openstack-octavia-amphora-image. I will add 2024.2 images later this day.
I was looking through the Octavia documentation today and tried to identify use cases and features.
Different drivers / possible other choices:
At first there it can be noted, that the VM or container with the load-balancer itself (amphora) can be substituted by other options(see: https://docs.openstack.org/octavia/latest/admin/providers/index.html):
- Amphora: the reference driver from the Octavia project
- A10 Networks OpenStack Octavia Driver: for Thunder, vThunder and AX Series Appliances
- F5 Networks Provider Driver for OpenStack Octavia by SAP SE
- OVN Octavia Provider Driver
- Radware Provider Driver for OpenStack Octavia
- VMware NSX
From all these drivers that could be used only OVN is compared to Amphora in several feature matrices here. Looking through this document, it can be seen that there is a huge gap that seems to mainly be around everything needed for Layer 7 loadbalancing, which OVN does not support.
Example Use Cases
Octavia guides for basic and Layer 7 load-balancing give many examples for use cases.
It could be very coarsely divided into:
- Load-balancers for each: UDP, TCP, HTTP, HTTPS
- Applying Health Monitors to you Load-Balancer
- Applying TLS-termination
- Applying Layer 7 Loadbalancing rules including authentication, redirecting requests with invalid certificates, using cookies
Problem: TLS
To make use of TLS termintation or re-encryption a deployment with a working Key-Manager is needed. For the scs project this means, we cannot make use of these features, because we do not mandate having a key-manager within a deployment.
Further Features
There are other features that come with Octavia, that are useful for operators mostly:
- Amphora Log Offloading (via syslog over the lbaas-management network)
- API Auditing (using Oslo messaging notifier -> could be routed to e.g. a log file)
- API Health Monitoring
- Octavia Flavors (predefined sets of provider configuration options // defined per provider driver)
- Amphora Failover Circuit Breaker (threshold for failovers to prevent mass failovers)
- using SR-IOV ports for Amphorae (increasing performance)
This might be important concerning Octavia with OVN backend: https://github.com/osism/issues/issues/959
To make our Cluster Stacks working (or other Cluster API solutions) without special hacks, we need LBaaSv2 loadbalancers at two places: (1) In front of kube-api (created by capo) (2) In front of a deployed ingress controller (or gateway) (created by OCCM) Neither requires TLS termination to work. (TLS termination is a feature often desired by users of VM-based workloads, so you may still consider the option to have it.)
I have been trying to use the OVN provider instead of amphorae in Cluster-API-Provider (KaaS-v1), because this makes the thing much more resource-efficient, more reliable and also allows for seeing the client-IPs. (In general, I'm more convinced of the design of doing L3 loadbalancing right at the network level and leaving the L7 complexity to some other place.)
Historic information is here, updated by https://github.com/SovereignCloudStack/k8s-cluster-api-provider/blob/main/Release-Notes-R6.md#ovn-lb I don't remember the status of configuring OVN provider LB for cluster-stacks; I know it was on the list of things to do. Maybe @jschoone or @chess-knight or @lindenb1 can comment on it.
Current state of the implementation of the octavia-operator for YAOOK:
- database and message-queue for octavia are up and running
- all octavia-services are up and running, while each service runs within its own pod
- octavia-api is reachable and does respond on openstack-client requests
- configuration of the load balancer management network and octavia-specific certificate-configuration still in progress, so the octavia-configuration is not complete at the moment
Update status of the octavia-operator in YAOOK:
Additional to the last week missing network-configuration and the certificates, some other problems appeared while testing, which were fixed thanks to debug-support by @markus-hentsch. Current prototype of the octavia-operator with amphora works. The setup was successfully tested so far by creating a load-balancer with 2 VMs behind the load-balancer and successfully accessed both VMs round-robin over the floating-ip, which was bound to the load-balancer.
The current state is only a first prototype and not ready for merge to the main-branch. Still open tasks:
- cleanup code (remove debug stuff and so on)
- write documentation
- automate certificate creation (this was done manually for the test, because of the passphrase required by octavia for the keys)
- cleaner solution for the traffic between the amphora load-balancer-VM and the health-manager
- write unit- and integration-tests
Regarding the SCS KaaS requirements for load balancing: there is a draft for a DR (though it might be changed to a standard) that requires the Service type LoadBalancer. This is only L3/L4 load balancing, which could well be provided by Octavia's OVN backend.
There is also an octavia ingress controller, which uses the LBaaS L7 capabilities to implement the Ingress API. We currently don't require an Ingress controller to be created in KaaS clusters, but even if that changes, there are Pod-based ingress controllers that work behind a Service of type LoadBalancer and do not require external L7 support.
Restructured code and added basic documentation. Created PR in Draft-state so far to make the checks of the CI-pipeline green: https://gitlab.com/yaook/operator/-/merge_requests/2679 , but unit- and integration-tests still have to be done
While trying to add a second provider network to my test-deployment, which should work as load-balancer manger network for further tests, I accidentally broke my OVN by an invalid configuration, which takes/took me some time to fix.
While trying to add a second provider network to my test-deployment, which should work as load-balancer manger network for further tests, I accidentally broke my OVN by an invalid configuration, which takes/took me some time to fix.
I removed the Nova and Neutron layer from the test deployment again to completely wipe the networking stuff, cleaned up any remnants and redeployed both services. It should work again now.
Added unit- and integration-tests for the new octavia-operator in YAOOK and after testing also with dedicated load-balancer-provider-network, I marked the merge-requests for review:
- https://gitlab.com/yaook/operator/-/merge_requests/2679
- https://gitlab.com/yaook/images/octavia/-/merge_requests/1
(the octavia-image has to be merged first in order so set valid image-tags, so the operator is still in draft-mode at the moment)
Basic version for an octavia-operator with the Amphora provider was merged in YAOOK. Manual for the current implementation is also on the YAOOK-documentation: https://docs.yaook.cloud/handbook/octavia-operator.html
@kitsudaiki this is awesome.
Further updates on the octavia-operator in YAOOK will be done in follow-up issues like:
- https://gitlab.com/yaook/operator/-/issues/525
- https://gitlab.com/yaook/operator/-/issues/524
- https://gitlab.com/yaook/operator/-/issues/520
Because the basic requirement of this issue here is fulfilled, this issue will be closed.