network: L3 routed networking without shared L2 domain
The required feature described as a wish
Networking without Layer 2
This proposal is to add a new networking feature to CloudStack where Instances are directly assigned /32 (IPv4) and/or /128 (IPv6) addresses without a shared Layer 2 domain.
A shared Layer 2 domain in this case would be a VLAN or VXLAN VNI where Instances shared the same Broadcast/Multicast domain and where they use a shared IP-gateway for their routing.
Layer 3
By leveraging various features of the Linux kernel, making this a KVM-only feature, we can directly route an IPv4 and/or IPv6 address to a virtual machine by using a dynamic routing protocol like BGP, but this could also work with OSPF(v3).
By eliminating the need for Layer 2 we can create a routed network where no Instance has a "network relationship" with another Instance. Every Instance has one or more routes installed in the routing table of the network and can be routed to any host at any time.
In the examples below I will use two IP-addresses:
- 2.57.57.30
- 2001:678:3a4:100::80
Hypervisor host as gateway
On the hypervisor the cloudbr0 bridge will be created and assigned an IPv4 and IPv6 address:
auto cloudbr0
iface cloudbr0 inet static
address 169.254.0.1/32
address fe80::1/64
bridge-ports none
bridge-stp off
bridge-fd 0
All Instances will be connected to this bridge and they will be configured to use the following IP-gateways:
- 169.254.0.1
- fe80::1
Inside the Instance
As there is no Layer 2 available the IP-configuration within the VM has to be done using ConfigDrive for cloud-init, a Virtual Router handing out DHCP and cloud-init data is not possible in this design. For the VM there is no way of detecting the cloud-init source over the network as our current CloudStack provider within cloud-init relies on the DHCP server as a source.
After the Instance has used cloud-init to fetch the networking information from ConfigDrive the Netplan (Ubuntu Linux) configuration would look like this:
network:
ethernets:
ens18:
accept-ra: no
nameservers:
addresses:
- 2620:fe::fe
- 2620:fe::9
addresses:
- 2.57.57.30/32
- 2001:678:3a4:100::80/128
routes:
- to: default
via: fe80::1
- to: default
via: 169.254.0.1
on-link: true
version: 2
In this configuration the network inside the Instance is configured to the the addresses configured on cloudbr0 as the gateway, meaning that the hypervisor will act as the gateway and route the IP-traffic.
This results in the interface being configured:
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:02:45:76:d2:35 brd ff:ff:ff:ff:ff:ff
altname enp0s18
inet 2.57.57.30/32 scope global ens18
valid_lft forever preferred_lft forever
inet6 2001:678:3a4:100::80/128 scope global
valid_lft forever preferred_lft forever
All hypervisors will use an identical configuration for cloudbr0, this allows all Instances to have the same routes in their route table:
root@web01:~# ip -6 route show
::1 dev lo proto kernel metric 256 pref medium
2001:678:3a4:100::80 dev ens18 proto kernel metric 256 pref medium
fe80::/64 dev ens18 proto kernel metric 256 pref medium
default via fe80::1 dev ens18 proto static metric 1024 pref medium
root@web01:~# ip -4 route show
default via 169.254.0.1 dev ens18 proto static onlink
root@web01:~#
ARP and NDP neighbor configuration
CloudStack is aware of the IPv4 and/or IPv6 addresses assigned to an Instance as well as the MAC address. On the hypervisor these entries have to be installed into the kernel's routing table and neighbor table. In this example the commands would be:
ip -6 route add 2001:678:3a4:100::80/128 dev cloudbr0
ip -6 neigh add 2001:678:3a4:100::80 lladdr 52:02:45:76:d2:35 dev cloudbr0 nud permanent
ip -4 route add 2.57.57.30/32 dev cloudbr0
ip -4 neigh add 2.57.57.30 lladdr 52:02:45:76:d2:35 dev cloudbr0 nud permanent
These entries would need to be added upon Instance start on that host and removed on Instance stop/migrate. The KVM Agent should handle the orchestration of these entries.
Dynamic Routing
Configuring these entries in the routing table is not sufficient, these need to be advertised to the upstream network. For this the hypervisor host would need to use some form of dynamic routing. BGP is the most commonly used, while others would like to use OSPF(v3).
In both cases the hypervisor will announce these /32 (IPv4) and /128 (IPv6) addresses to the upstream network while receiving a default route (0.0.0.0/0 and ::/0) from the network to be able to route traffic.
A very simple piece of configuration for FRRouting (BGP or OSPF) could be:
BGP
router bgp
redistribute kernel route-map only-cloud
!
route-map only-cloud permit 10
match interface cloudbr0
OSPF
redistribute kernel route-map only-cloud
network YOUR_NETWORK/XX area 0.0.0.0
!
route-map only-cloud permit 10
match interface cloudbr0
IP address pools
As each Instance is assigned a IPv4 and/or IPv6 address there is no need to create a "network" inside CloudStack. The concept would be that CloudStack simply has a pool of addresses to choose from and allocates them to an Instance
A pool could be:
-
2.57.57.80
-
145.31.53.21
-
90.78.37.15
-
88.17.11.53
-
2001:db8::100
-
2001:678:3a4:100::80
-
2a00:f10:415:27::100
These addresses have no relationship with eachother, but they don't have to as each individual address is assigned the a VM
This networking setup also allows for very easy single stack IPv6-only Virtual Machines where IPv4 can be added or removed when needed. There is no dependency on either of the two protocols.
Summary
This networking design completely eliminates the use of Layer 2 broadcast/multicast domains. Each Instance becomes a full L3 routed part of the network where CloudStack's orchestration will make sure the addresses are routed towards the host where the Instance is on.
Using this setup it's very easy to create a massively scalable and reliable network spanning multiple datacenters as there is no shared L2 or VXLAN overlay.
The most common use-case for this feature will probably be public cloud providers which need to assign public IPv4/IPv6 addresses to Instance and want to share nothing between the VMs.
On my personal blog I also wrote an article about this, could be of help: https://blog.widodh.nl/2025/12/linux-bridging-with-virtual-machines-and-pure-l3-routing-and-bgp/
This fashion of IP assignment can simplify network management for large public clouds. However, in some datacenters, IP ranges are delivered to servers through VLANs which would not be compatible with this setup. Also, user data is nowadays crucial for cloud application deployment. I believe, this should be implemented as a separate networking option to keep everything working smoothly as before.
One benefit for this setup would be that there would be no need to allocate a full IP subnet which sometimes costs 2 IPs to be reserved for the subneting. So individual IPs can be used for IP assignment.
This fashion of IP assignment can simplify network management for large public clouds. However, in some datacenters, IP ranges are delivered to servers through VLANs which would not be compatible with this setup. That is correct, but this feature would not be aimed at those use-cases. This feature requires you to have full control over the networking fabric and being able to use dynamic routing protocols (BGP) everywhere.
I believe that when you want to use this feature because your use-case requires it that you have already designed such a network or are able to deploy it.
This is aimed at large scale.
Also, user data is nowadays crucial for cloud application deployment. I believe, this should be implemented as a separate networking option to keep everything working smoothly as before.
As I wrote, we could and should use ConfigDrive for this. Using CD we can provide the VM with the required networking configuration.
One benefit for this setup would be that there would be no need to allocate a full IP subnet which sometimes costs 2 IPs to be reserved for the subneting. So individual IPs can be used for IP assignment.
Yes, exactly! You can allocate all 255 addresses of an IPv4 /24 subnet for example. With IPv6 this is no problem as there are more than enough addresses in a subnet, but with IPv4 it can be a problem. And if you run out, you can just allocate an additional /26 (example) to CloudStack and use all addresses.
You get same/similar result with anycast gateway and EVPN-type2 (MAC+IP) routes... EVPN provides host routes for each IP, each Hypervisor can have an anycast gateway for public networks, so the VM's get routed out by their hypervisors.
This suggestion would, however, require a simpler network setup (EVPN and VXLAN not needed) and less training and easier to troubleshoot. How does the VM connect to the outside world or to 169.254.0.1/32, rather? What type of interface/driver would a VM use?
You get same/similar result with anycast gateway and EVPN-type2 (MAC+IP) routes... EVPN provides host routes for each IP, each Hypervisor can have an anycast gateway for public networks, so the VM's get routed out by their hypervisors.
This suggestion would, however, require a simpler network setup (EVPN and VXLAN not needed) and less training and easier to troubleshoot. How does the VM connect to the outside world or to 169.254.0.1/32, rather? What type of interface/driver would a VM use?
I have extensive experience with VXLAN+EVPN and I love it, truly do! Many presentations and talks are about how this works and use it in production.
But, it's complex and you have to deal with an overlay and underlay network. Equipment which can offload it, handle it. It's not that easy to operate. This proposal is indeed much simpler and straightforward routing.
If you look at the example I posted above, these are the default routes for the VM:
root@web01:~# ip -6 route show
::1 dev lo proto kernel metric 256 pref medium
2001:678:3a4:100::80 dev ens18 proto kernel metric 256 pref medium
fe80::/64 dev ens18 proto kernel metric 256 pref medium
default via fe80::1 dev ens18 proto static metric 1024 pref medium
root@web01:~# ip -4 route show
default via 169.254.0.1 dev ens18 proto static onlink
root@web01:~#
I did a tracepath on this VM (running on Proxmox as a PoC) to Quad9 DNS:
root@web01:~# tracepath 9.9.9.9 -n
1?: [LOCALHOST] pmtu 1500
1: 169.254.0.1 0.231ms
1: 169.254.0.1 0.087ms
2: 185.187.12.3 6.733ms
3: 185.187.12.170 0.682ms
4: 193.239.116.123 1.603ms !H
Resume: pmtu 1500
root@web01:~#
root@web01:~# tracepath 2620:fe::fe -n
1?: [LOCALHOST] 0.018ms pmtu 1500
1: 2001:678:3a4:100::1 0.133ms
1: 2001:678:3a4:100::1 0.096ms
2: 2001:678:3a4:1::2 0.421ms asymm 3
3: 2a0b:8f80::ae 2.988ms asymm 4
4: 2a05:1500:ff00:50::a 0.508ms asymm 3
5: no reply
6: no reply
7: no reply
8: no reply
^C
root@web01:~#
Yes, I'm just wondering how the hypervisor and VM communicate. Which network driver are you using?
Yes, I'm just wondering how the hypervisor and VM communicate. Which network driver are you using?
Nothing special. Just virtio as a NIC, just like in any other situation. On the hypervisor there is a Linux bridge which is the IPv4/IPv6 gateway for the VM. No special driver required.
Ok, then I don't get it. How does the VM communicate with the gw on the bridge SVI. They're on separate subnets, you install a default gw to 169.254.0.1 (/32) on the VM but it's unreachable, right? Is there some magic going on in the driver?
Ok, then I don't get it. How does the VM communicate with the gw on the bridge SVI. They're on separate subnets, you install a default gw to 169.254.0.1 (/32) on the VM but it's unreachable, right? Is there some magic going on in the driver?
See this part in the netplan config:
- to: default
via: 169.254.0.1
on-link: true
The magic is the 'on-link' here, where you tell the Linux kernel that this IP is on the same network and that ARP has to be used.
root@web01:~# ip -4 route show
default via 169.254.0.1 dev ens18 proto static onlink
root@web01:~#
For IPv6 this is not required as fe80::1 is part of the link-local spec.
Oooooh! That's interesting! I didn't know about that knob. That is true magic! Hahah. I'll definitely try it out in a lab.
This is really a good idea, thanks @wido
we could introduce a new type of zone (similar to Edge zone), let's call it "Routed zone".
what's your thoughts on the following topics ?
- is CPVM needed ? we are thinking of moving CPVM from VM to a package like
cloudstack-console-proxywhich can be installed anywhere. - is SSVM needed ? ACS supports direct-download templates for kvm zones. This could be also moved to a package in the future.
- is SG supported ? It is required for public cloud providers I think. we could consider new implementation using libvirt nwfilter (related to #8951)
This is really a good idea, thanks @wido
Thanks! I must say I'm excited about this. Full L3, I love it. True routing!
we could introduce a new type of zone (similar to Edge zone), let's call it "Routed zone".
Yes, sounds like a plan.
what's your thoughts on the following topics ?
* is CPVM needed ? we are thinking of moving CPVM from VM to a package like `cloudstack-console-proxy` which can be installed anywhere.
We will need a Console Proxy as users will need to be able to connect to their console, right?
* is SSVM needed ? ACS supports direct-download templates for kvm zones. This could be also moved to a package in the future.
I would say yes. We just need to support this network config on the VR, the VR gets this data via 'cmdline' and we can configure the NIC that way.
* is SG supported ? It is required for public cloud providers I think. we could consider new implementation using libvirt nwfilter (related to [[Draft] KVM: enable no-mac-spoofing on virtual nics #8951](https://github.com/apache/cloudstack/pull/8951))
I don't see why the SG won't work, it's a regular bridge. No modification required.
nwfilter is much better in Libvirt, but our current code would work just fine.
Q: This does not segregate bad actors from each other on the same hypervisor, does it? Reason for Q: The assumption here is that the Client instances are semi under your control, to not use the root user to assign secondary IPs on the guest interfaces, correct? Also, you can't use anything like privacy extensions
I would, for the sake of security, also enforce a source guest-MAC source guest-IP and destination guest-MAC Destination guest-IP filter rules on the guest interfaces. inside the hypervisor
I come from a setup where I have users sharing L2 VLANs between guests, and then segregating their L2 VLANs with a firewall to the rest of the world.
Q: This does not segregate bad actors from each other on the same hypervisor, does it? Reason for Q: The assumption here is that the Client instances are semi under your control, to not use the root user to assign secondary IPs on the guest interfaces, correct? Also, you can't use anything like privacy extensions
They can assign additional IPs if they want, but this will never reach the routing table of the hypervisor. These IPs will never function nor work.
I would, for the sake of security, also enforce a source guest-MAC source guest-IP and destination guest-MAC Destination guest-IP filter rules on the guest interfaces. inside the hypervisor
The CloudStack Security Grouping already does that. Packets coming out of the VM are matched on the source IP address and MAC address. If it doesn't match what CloudStack expects it is dropped by the HV. Already existing functionality.
I come from a setup where I have users sharing L2 VLANs between guests, and then segregating their L2 VLANs with a firewall to the rest of the world.
Yes, but as mentioned, CS already does this.
Q: This does not segregate bad actors from each other on the same hypervisor, does it? Reason for Q: The assumption here is that the Client instances are semi under your control, to not use the root user to assign secondary IPs on the guest interfaces, correct? Also, you can't use anything like privacy extensions
They can assign additional IPs if they want, but this will never reach the routing table of the hypervisor. These IPs will never function nor work.
I would, for the sake of security, also enforce a source guest-MAC source guest-IP and destination guest-MAC Destination guest-IP filter rules on the guest interfaces. inside the hypervisor
The CloudStack Security Grouping already does that. Packets coming out of the VM are matched on the source IP address and MAC address. If it doesn't match what CloudStack expects it is dropped by the HV. Already existing functionality.
I come from a setup where I have users sharing L2 VLANs between guests, and then segregating their L2 VLANs with a firewall to the rest of the world.
Yes, but as mentioned, CS already does this.
agree. security group should be always enabled in the zone/network.
Q: This does not segregate bad actors from each other on the same hypervisor, does it? Reason for Q: The assumption here is that the Client instances are semi under your control, to not use the root user to assign secondary IPs on the guest interfaces, correct? Also, you can't use anything like privacy extensions
They can assign additional IPs if they want, but this will never reach the routing table of the hypervisor. These IPs will never function nor work.
I would, for the sake of security, also enforce a source guest-MAC source guest-IP and destination guest-MAC Destination guest-IP filter rules on the guest interfaces. inside the hypervisor
The CloudStack Security Grouping already does that. Packets coming out of the VM are matched on the source IP address and MAC address. If it doesn't match what CloudStack expects it is dropped by the HV. Already existing functionality.
I come from a setup where I have users sharing L2 VLANs between guests, and then segregating their L2 VLANs with a firewall to the rest of the world.
Yes, but as mentioned, CS already does this.
agree. security group should be always enabled in the zone/network.
Yes, don't disagree there. But this network model doesn't allow for spoofing MAC or hijacking IPs. Without security groups you could sent out IP packets with a spoofed source address, which can be used for DDoS amplification attacks.
We should always filter in this network setup to prevent such traffic being sent from the VM.
We use something similar for our VMs, the "onlink" thing in Linux is very flexible, it allows the host to ARP for the MAC for a gateway that you route to the interface, but is not on the same subnet as the host IP (in fact there is no "subnet" as it's just a /32 or /128). Turns Ethernet into a pure data-link really, just a shim to allow IP packets to flow. Very neat.
We use something similar for our VMs, the "onlink" thing in Linux is very flexible, it allows the host to ARP for the MAC for a gateway that you route to the interface, but is not on the same subnet as the host IP (in fact there is no "subnet" as it's just a /32 or /128). Turns Ethernet into a pure data-link really, just a shim to allow IP packets to flow. Very neat.
Yes, it does. Super nice!
Windows can do this as well btw, it can also route to a "on-link" gateway.
route add <gateway-ip> mask 255.255.255.255 <interface-ip>
That should work