lxd icon indicating copy to clipboard operation
lxd copied to clipboard

Cannot maximize the bandwidth with UPLINK as a cluster without the aid of special hardware

Open nobuto-m opened this issue 1 month ago • 4 comments

In the current architecture of MicroCloud (as far as I understand), the network bandwidth to/from the UPLINK is limited when a VM is not running on the same hypervisor as the active OVN router. If there are 3 VMs on 3 hypervisors respectively and assign OVN forwards, I can get ~47 Gbps out of one VM but only get ~8 Gbps from two remaining VMs.

$ for i in {1..3}; do iperf3 -c 10.0.1.20$i -i 0 -P 4 -R; done | grep -wE 'host|SUM'
Connecting to host 10.0.1.201, port 5201
Reverse mode, remote host 10.0.1.201 is sending
[SUM]   0.00-10.01  sec  9.15 GBytes  7.86 Gbits/sec                  
[SUM]   0.00-10.01  sec  9.17 GBytes  7.87 Gbits/sec  19323             sender
[SUM]   0.00-10.01  sec  9.15 GBytes  7.86 Gbits/sec                  receiver
Connecting to host 10.0.1.202, port 5201
Reverse mode, remote host 10.0.1.202 is sending
[SUM]   0.00-10.01  sec  54.7 GBytes  47.0 Gbits/sec                  
[SUM]   0.00-10.01  sec  54.7 GBytes  47.0 Gbits/sec  68415             sender
[SUM]   0.00-10.01  sec  54.7 GBytes  47.0 Gbits/sec                  receiver
Connecting to host 10.0.1.203, port 5201
Reverse mode, remote host 10.0.1.203 is sending
[SUM]   0.00-10.01  sec  9.42 GBytes  8.08 Gbits/sec                  
[SUM]   0.00-10.01  sec  9.43 GBytes  8.10 Gbits/sec  17662             sender
[SUM]   0.00-10.01  sec  9.42 GBytes  8.08 Gbits/sec                  receiver

In my understanding, this is because only the active router is used even after assigning the forward, and the traffic from/to the remaining VMs must go through the OVN geneve tunneling to reach to the active router. The traffic is slow because of the nature of UDP tunneling, etc.

[Green is the current flow, and the blue is the ideal route] Image

There are some ideas to overcome this challenge to maximize the UPLINK bandwidth as a whole cluster (let's assume HTTP content servers with a edge cloud use case).

  • SR-IOV: VM's traffic directly go out/in without the UDP tunneling (downside: requiring a particular hardware and there is no mix of private network and public network)
  • hw-tc-offload: requires a special hardware
  • Bridge: it's not really a "MicroCloud" way even though it's possible as LXD since MicroCloud sets up OVN by default

Possible solution 1 - Distributed floating IP equivalent, i.e. distributed LXD forward

In the meantime with OpenStack + OVN, there is a configuration option as ovn.enable_distributed_floating_ip. And when it's enabled and a VM has a floating IP (LXD forward equivalent) then the traffic directly go out/in at the local hypervisor. https://docs.openstack.org/neutron/latest/admin/ovn/routing.html#distributed-floating-ip

It would be nice if MicroCloud (LXD) supports that kind of opt-in configuration so ~141 Gbps (47+47+47) would be possible as a edge cloud instead of the current ~63 Gbps (47+8+8).

Possible solution 2 - Flat/VLAN provider network equivalent, i.e. VM behind OVN can declare a IP address directly from the UPLINK network

In OpenStack cases, that's supported. Meaning users can select where to attach a VM network port either to a private network or directly to the external network as long as they have permission to do so. It doesn't require any additional physical networking port unlike additional LXD bridge but everything is managed by OVN layer. https://docs.openstack.org/neutron/latest/admin/deploy-ovs-provider.html#network-traffic-flow

Pros: No NAT, no OVN router in-between. i.e. ideal performance Cons: Not so cloud-like usage

Possible solution 3 - Announcing /32 through BGP through a local hypvervisor

As far as I understand, the current LXD/MicroCloud implementation is to announce a subnet through an active router. If we could announce the exact IP address of a VM (i.e. /32) through a local OVN router on a local hypervisor, that would be ideal with a BGP use case.

The complete testing steps on a single machine just for the record

Define testing networks

lxc network create test-mc --type bridge \
    dns.mode=none \
    ipv4.address=10.0.1.1/24 \
    ipv4.dhcp=false \
    ipv4.nat=true \
    ipv6.address=none

lxc network create test-mc-tunnel --type bridge \
    dns.mode=none \
    ipv4.address=10.0.2.1/24 \
    ipv4.dhcp=false \
    ipv4.nat=false \
    ipv6.address=none

Define a testing profile

lxc profile create test-mc-profile <<EOF
devices:
  main:
    name: enp5s0
    network: test-mc
    type: nic
  ovs:
    name: enp6s0
    network: test-mc
    type: nic
  tunnel:
    name: enp7s0
    network: test-mc-tunnel
    type: nic
  root:
    path: /
    pool: default
    type: disk
EOF

Create 3 VMs to be a MicroCloud cluster

for i in {1..3}; do
    cat <<EOF | lxc launch ubuntu:noble --vm test-mc-$i \
        --profile test-mc-profile \
        -c limits.cpu=6 -c limits.memory=6GiB \
        -c user.network-config="$(cat -)"
version: 2
ethernets:
  enp5s0:
    addresses:
      - 10.0.1.1${i}/24
    routes:
      - to: default
        via: 10.0.1.1
    nameservers:
      addresses:
        - 10.0.1.1
  enp6s0:
    dhcp4: false
    dhcp6: false
    accept-ra: false
  enp7s0:
    addresses:
      - 10.0.2.1${i}/24
EOF

done

Install snaps

for i in {1..3}; do 
    lxc exec test-mc-$i -- bash -c '
        snap install lxd --cohort="+"
        snap install microovn --cohort="+"
        snap install microcloud --cohort="+"
    '
done

Bootstrap MicroCloud

for i in {1..3}; do
    cat <<EOF | lxc exec test-mc-$i -- bash -c 'microcloud preseed' &
lookup_subnet: 10.0.1.0/24
initiator: test-mc-1
session_passphrase: test

systems:
- name: test-mc-1
  ovn_uplink_interface: enp6s0
  ovn_underlay_ip: 10.0.2.11
- name: test-mc-2
  ovn_uplink_interface: enp6s0
  ovn_underlay_ip: 10.0.2.12
- name: test-mc-3
  ovn_uplink_interface: enp6s0
  ovn_underlay_ip: 10.0.2.13

ovn:
  ipv4_gateway: 10.0.1.1/24
  ipv4_range: 10.0.1.101-10.0.1.150
  dns_servers: 10.0.1.1
EOF

done

wait -n

Add local storage for testing

lxc exec test-mc-1 -- bash -c '
    for i in {1..3}; do
        lxc storage create default dir --target test-mc-$i
    done

    lxc storage create default dir

    lxc profile device add default root disk pool=default path=/
'

Create 3 VMs on 3 MicroCloud hosts respectively

lxc exec test-mc-1 -- bash -c '
    for i in {1..3}; do
        lxc launch ubuntu:noble --vm test-vm-$i --target test-mc-$i \
            -c limits.cpu=4 -c limits.memory=4GiB
    done
'

Install iper3 server in 3 VMs

lxc exec test-mc-1 -- bash -c '
    for i in {1..3}; do
        lxc exec test-vm-$i -- bash -c "
            env DEBIAN_FRONTEND=noninteractive apt install -Uy iperf3
            systemctl enable --now iperf3.service
        "
    done
'

Setup OVN forwards to 3 VMs

lxc exec test-mc-1 -- bash -c '
    lxc network set UPLINK ipv4.routes=10.0.1.201/32,10.0.1.202/32,10.0.1.203/32
    for i in {1..3}; do
        lxc network forward create default 10.0.1.20$i target_address=$(lxc list test-vm-$i -c 4 -f csv | cut -d" " -f1)
    done
'

Run iperf3

$ for i in {1..3}; do iperf3 -c 10.0.1.20$i -i 0 -P 4 -R; done | grep -wE 'host|SUM'
Connecting to host 10.0.1.201, port 5201
Reverse mode, remote host 10.0.1.201 is sending
[SUM]   0.00-10.01  sec  9.15 GBytes  7.86 Gbits/sec                  
[SUM]   0.00-10.01  sec  9.17 GBytes  7.87 Gbits/sec  19323             sender
[SUM]   0.00-10.01  sec  9.15 GBytes  7.86 Gbits/sec                  receiver
Connecting to host 10.0.1.202, port 5201
Reverse mode, remote host 10.0.1.202 is sending
[SUM]   0.00-10.01  sec  54.7 GBytes  47.0 Gbits/sec                  
[SUM]   0.00-10.01  sec  54.7 GBytes  47.0 Gbits/sec  68415             sender
[SUM]   0.00-10.01  sec  54.7 GBytes  47.0 Gbits/sec                  receiver
Connecting to host 10.0.1.203, port 5201
Reverse mode, remote host 10.0.1.203 is sending
[SUM]   0.00-10.01  sec  9.42 GBytes  8.08 Gbits/sec                  
[SUM]   0.00-10.01  sec  9.43 GBytes  8.10 Gbits/sec  17662             sender
[SUM]   0.00-10.01  sec  9.42 GBytes  8.08 Gbits/sec                  receiver

nobuto-m avatar Oct 29 '25 18:10 nobuto-m

Hey, based on recent discussions we are going to evaluate "solution 1". @tomponline please jump in if I am under the wrong impression here.

As this is something related to code changes in LXD, I'll transfer the issue. MicroCloud will surely benefit from it but likely doesn't require any code changes to accommodate the solution.

roosterfish avatar Nov 13 '25 08:11 roosterfish

@roosterfish based on our discussions with OVN team it'll most likely be option 3 as that enables the use of cluster-wide overlay networking, whilst still allowing both 1:1 NAT and non-NAT setups to announce routes from the instance-local hypervisor, and also allows anycast based load balancing.

tomponline avatar Nov 13 '25 09:11 tomponline

@roosterfish based on our discussions with OVN team it'll most likely be option 3 as that enables the use of cluster-wide overlay networking, whilst still allowing both 1:1 NAT and non-NAT setups to announce routes from the instance-local hypervisor, and also allows anycast based load balancing.

Yes your are right, I was more thinking about the use case if you don't use BGP. Maybe both scenarios (with/without BGP) could be accommodated when we switch to a new approach?

roosterfish avatar Nov 13 '25 09:11 roosterfish

My understanding is the focus is on BGP at this time.

tomponline avatar Nov 13 '25 09:11 tomponline