Cannot maximize the bandwidth with UPLINK as a cluster without the aid of special hardware
In the current architecture of MicroCloud (as far as I understand), the network bandwidth to/from the UPLINK is limited when a VM is not running on the same hypervisor as the active OVN router. If there are 3 VMs on 3 hypervisors respectively and assign OVN forwards, I can get ~47 Gbps out of one VM but only get ~8 Gbps from two remaining VMs.
$ for i in {1..3}; do iperf3 -c 10.0.1.20$i -i 0 -P 4 -R; done | grep -wE 'host|SUM'
Connecting to host 10.0.1.201, port 5201
Reverse mode, remote host 10.0.1.201 is sending
[SUM] 0.00-10.01 sec 9.15 GBytes 7.86 Gbits/sec
[SUM] 0.00-10.01 sec 9.17 GBytes 7.87 Gbits/sec 19323 sender
[SUM] 0.00-10.01 sec 9.15 GBytes 7.86 Gbits/sec receiver
Connecting to host 10.0.1.202, port 5201
Reverse mode, remote host 10.0.1.202 is sending
[SUM] 0.00-10.01 sec 54.7 GBytes 47.0 Gbits/sec
[SUM] 0.00-10.01 sec 54.7 GBytes 47.0 Gbits/sec 68415 sender
[SUM] 0.00-10.01 sec 54.7 GBytes 47.0 Gbits/sec receiver
Connecting to host 10.0.1.203, port 5201
Reverse mode, remote host 10.0.1.203 is sending
[SUM] 0.00-10.01 sec 9.42 GBytes 8.08 Gbits/sec
[SUM] 0.00-10.01 sec 9.43 GBytes 8.10 Gbits/sec 17662 sender
[SUM] 0.00-10.01 sec 9.42 GBytes 8.08 Gbits/sec receiver
In my understanding, this is because only the active router is used even after assigning the forward, and the traffic from/to the remaining VMs must go through the OVN geneve tunneling to reach to the active router. The traffic is slow because of the nature of UDP tunneling, etc.
[Green is the current flow, and the blue is the ideal route]
There are some ideas to overcome this challenge to maximize the UPLINK bandwidth as a whole cluster (let's assume HTTP content servers with a edge cloud use case).
- SR-IOV: VM's traffic directly go out/in without the UDP tunneling (downside: requiring a particular hardware and there is no mix of private network and public network)
- hw-tc-offload: requires a special hardware
- Bridge: it's not really a "MicroCloud" way even though it's possible as LXD since MicroCloud sets up OVN by default
Possible solution 1 - Distributed floating IP equivalent, i.e. distributed LXD forward
In the meantime with OpenStack + OVN, there is a configuration option as ovn.enable_distributed_floating_ip. And when it's enabled and a VM has a floating IP (LXD forward equivalent) then the traffic directly go out/in at the local hypervisor.
https://docs.openstack.org/neutron/latest/admin/ovn/routing.html#distributed-floating-ip
It would be nice if MicroCloud (LXD) supports that kind of opt-in configuration so ~141 Gbps (47+47+47) would be possible as a edge cloud instead of the current ~63 Gbps (47+8+8).
Possible solution 2 - Flat/VLAN provider network equivalent, i.e. VM behind OVN can declare a IP address directly from the UPLINK network
In OpenStack cases, that's supported. Meaning users can select where to attach a VM network port either to a private network or directly to the external network as long as they have permission to do so. It doesn't require any additional physical networking port unlike additional LXD bridge but everything is managed by OVN layer. https://docs.openstack.org/neutron/latest/admin/deploy-ovs-provider.html#network-traffic-flow
Pros: No NAT, no OVN router in-between. i.e. ideal performance Cons: Not so cloud-like usage
Possible solution 3 - Announcing /32 through BGP through a local hypvervisor
As far as I understand, the current LXD/MicroCloud implementation is to announce a subnet through an active router. If we could announce the exact IP address of a VM (i.e. /32) through a local OVN router on a local hypervisor, that would be ideal with a BGP use case.
The complete testing steps on a single machine just for the record
Define testing networks
lxc network create test-mc --type bridge \
dns.mode=none \
ipv4.address=10.0.1.1/24 \
ipv4.dhcp=false \
ipv4.nat=true \
ipv6.address=none
lxc network create test-mc-tunnel --type bridge \
dns.mode=none \
ipv4.address=10.0.2.1/24 \
ipv4.dhcp=false \
ipv4.nat=false \
ipv6.address=none
Define a testing profile
lxc profile create test-mc-profile <<EOF
devices:
main:
name: enp5s0
network: test-mc
type: nic
ovs:
name: enp6s0
network: test-mc
type: nic
tunnel:
name: enp7s0
network: test-mc-tunnel
type: nic
root:
path: /
pool: default
type: disk
EOF
Create 3 VMs to be a MicroCloud cluster
for i in {1..3}; do
cat <<EOF | lxc launch ubuntu:noble --vm test-mc-$i \
--profile test-mc-profile \
-c limits.cpu=6 -c limits.memory=6GiB \
-c user.network-config="$(cat -)"
version: 2
ethernets:
enp5s0:
addresses:
- 10.0.1.1${i}/24
routes:
- to: default
via: 10.0.1.1
nameservers:
addresses:
- 10.0.1.1
enp6s0:
dhcp4: false
dhcp6: false
accept-ra: false
enp7s0:
addresses:
- 10.0.2.1${i}/24
EOF
done
Install snaps
for i in {1..3}; do
lxc exec test-mc-$i -- bash -c '
snap install lxd --cohort="+"
snap install microovn --cohort="+"
snap install microcloud --cohort="+"
'
done
Bootstrap MicroCloud
for i in {1..3}; do
cat <<EOF | lxc exec test-mc-$i -- bash -c 'microcloud preseed' &
lookup_subnet: 10.0.1.0/24
initiator: test-mc-1
session_passphrase: test
systems:
- name: test-mc-1
ovn_uplink_interface: enp6s0
ovn_underlay_ip: 10.0.2.11
- name: test-mc-2
ovn_uplink_interface: enp6s0
ovn_underlay_ip: 10.0.2.12
- name: test-mc-3
ovn_uplink_interface: enp6s0
ovn_underlay_ip: 10.0.2.13
ovn:
ipv4_gateway: 10.0.1.1/24
ipv4_range: 10.0.1.101-10.0.1.150
dns_servers: 10.0.1.1
EOF
done
wait -n
Add local storage for testing
lxc exec test-mc-1 -- bash -c '
for i in {1..3}; do
lxc storage create default dir --target test-mc-$i
done
lxc storage create default dir
lxc profile device add default root disk pool=default path=/
'
Create 3 VMs on 3 MicroCloud hosts respectively
lxc exec test-mc-1 -- bash -c '
for i in {1..3}; do
lxc launch ubuntu:noble --vm test-vm-$i --target test-mc-$i \
-c limits.cpu=4 -c limits.memory=4GiB
done
'
Install iper3 server in 3 VMs
lxc exec test-mc-1 -- bash -c '
for i in {1..3}; do
lxc exec test-vm-$i -- bash -c "
env DEBIAN_FRONTEND=noninteractive apt install -Uy iperf3
systemctl enable --now iperf3.service
"
done
'
Setup OVN forwards to 3 VMs
lxc exec test-mc-1 -- bash -c '
lxc network set UPLINK ipv4.routes=10.0.1.201/32,10.0.1.202/32,10.0.1.203/32
for i in {1..3}; do
lxc network forward create default 10.0.1.20$i target_address=$(lxc list test-vm-$i -c 4 -f csv | cut -d" " -f1)
done
'
Run iperf3
$ for i in {1..3}; do iperf3 -c 10.0.1.20$i -i 0 -P 4 -R; done | grep -wE 'host|SUM'
Connecting to host 10.0.1.201, port 5201
Reverse mode, remote host 10.0.1.201 is sending
[SUM] 0.00-10.01 sec 9.15 GBytes 7.86 Gbits/sec
[SUM] 0.00-10.01 sec 9.17 GBytes 7.87 Gbits/sec 19323 sender
[SUM] 0.00-10.01 sec 9.15 GBytes 7.86 Gbits/sec receiver
Connecting to host 10.0.1.202, port 5201
Reverse mode, remote host 10.0.1.202 is sending
[SUM] 0.00-10.01 sec 54.7 GBytes 47.0 Gbits/sec
[SUM] 0.00-10.01 sec 54.7 GBytes 47.0 Gbits/sec 68415 sender
[SUM] 0.00-10.01 sec 54.7 GBytes 47.0 Gbits/sec receiver
Connecting to host 10.0.1.203, port 5201
Reverse mode, remote host 10.0.1.203 is sending
[SUM] 0.00-10.01 sec 9.42 GBytes 8.08 Gbits/sec
[SUM] 0.00-10.01 sec 9.43 GBytes 8.10 Gbits/sec 17662 sender
[SUM] 0.00-10.01 sec 9.42 GBytes 8.08 Gbits/sec receiver
Hey, based on recent discussions we are going to evaluate "solution 1". @tomponline please jump in if I am under the wrong impression here.
As this is something related to code changes in LXD, I'll transfer the issue. MicroCloud will surely benefit from it but likely doesn't require any code changes to accommodate the solution.
@roosterfish based on our discussions with OVN team it'll most likely be option 3 as that enables the use of cluster-wide overlay networking, whilst still allowing both 1:1 NAT and non-NAT setups to announce routes from the instance-local hypervisor, and also allows anycast based load balancing.
@roosterfish based on our discussions with OVN team it'll most likely be option 3 as that enables the use of cluster-wide overlay networking, whilst still allowing both 1:1 NAT and non-NAT setups to announce routes from the instance-local hypervisor, and also allows anycast based load balancing.
Yes your are right, I was more thinking about the use case if you don't use BGP. Maybe both scenarios (with/without BGP) could be accommodated when we switch to a new approach?
My understanding is the focus is on BGP at this time.