WIP: use netlink instead of calling iproute2 commands
See #1978
This is still WIP, the PR at this time is mainly for running the CI. Feedback is very welcome though, especially on the non-trivial TODOs.
@stgraber we currently don't have any tests specifically for this package, if any of the functions are tested then only incidentally. Unit tests don't really work since this interacts with the system and there's no point in mocking that. I'm thinking maybe a test program that can run any of these functions via reflection, and then a test script that calls said program and uses iproute2 tools to check if it does things correctly. And ideally I'd like to be able to run these tests locally without messing up my network setup, so in a VM maybe. I'm not sure if that fits into our current testing system somehow? Some of these might not even be possible to test in the CI, for example vrf is a kernel module that may or may not be loaded.
Yeah, it's not the kind of code that's particularly directly testable, it makes more sense to have it be indirectly tested through or normal system tests and the daily test runs we have on Jenkins too.
Hmh can we generate coverage data from the existing tests to see if they hit all the code paths in this package?
Hmm, not particularly easily as we typically run our tests across a LOT of incusd instances, so can't centrally keep track of everything that was hit. And some of the more complex bits (SR-IOV and the like) are tested externally, some manually, some automatically through Jenkins.
Hmh I see The tests already caught some bugs that I'm fixing now but without knowing if everything is tested it's hard having confidence in the end result
@gwenya yeah, the best we can do is land this one early in the release to maximize the time we have for all our various tests to catch issues.
The vlan link type is blocked on https://github.com/vishvananda/netlink/pull/1078, since the netlink library currently does not support setting the gvrp flag on vlan links.
The veth link type is semi-blocked on https://github.com/vishvananda/netlink/pull/1079 since the netlink library currently does not support setting more than peer name and address on veth create, but we could probably work around that by adjusting the other settings after creating it.
@stgraber can you check if there's something wrong with the cluster tests at the moment or if my changes broke something?
We've seen cluster failures on other PRs too, no idea what changed.
@gwenya since you cleared the draft status, I'm assuming this is now good to go?
@gwenya this work looks great, thanks!
On top of the comments above addressing the TODO left in this PR, the only other thing that caught my eye are the error messages. Incus' pattern for error messages is to start with a capital letter.
(Yes, this goes counter to Go recommendations but we've generally found this to be more readable and we're trying to keep things consistent. We've been testing Go's recommendation in some other projects like Incus OS and may roll that out to Incus eventually if found to work well enough)
@gwenya ping
Oh sorry I completely missed your review, thanks for pinging me! I’ll take a closer look at it later today.
I just noticed that we parse the routing table and various other network information from procfs in various places like here: https://github.com/lxc/incus/blob/28452f93d607f07bca0aa1c01de50bc4e2ec5b13/internal/server/network/network_utils.go#L650
Should we see if we can use the ip package there as well? Might be best in a separate PR though.
I just noticed that we parse the routing table and various other network information from procfs in various places like here: https://github.com/lxc/incus/blob/28452f93d607f07bca0aa1c01de50bc4e2ec5b13/internal/server/network/network_utils.go#L650
Should we see if we can use the ip package there as well? Might be best in a separate PR though.
Yeah, that'd make sense to move that over to the ip package and see if there's a more optimized way to get that out of netlink.
@gwenya branch looks good to me.
Do you want to do the net.IP/net/IPNet switch over and remove that last TODO before we merge this one?
@stgraber will do!
@stgraber I did the change to using net.IP and net.IPNet, both in the file that had the TODO and in the other places as well. Let me know if it looks good and I'll rebase it into the existing commits.
Remaining issue to track down:
2025-06-22T23:25:22.0304357Z time="2025-06-22T23:25:22Z" level=error msg="Failed to restore route" driver=bridge err="Failed to replace route {Ifindex: 3 Dst: 0.0.0.0/0 Src: <nil> Gw: 100.64.1.1 Flags: [] Table: 254 Realm: 0}: network is unreachable: Nexthop has invalid gateway" network=inc27523x project=default
That hits in the cluster tests. It's non-fatal as it's a non-fatal backup/restore of routes, but the failure is unexpected.
I added the scope to our route struct so we can add it back with the same scope, I think that should fix it
Sounds good. I'm planning on doing another quick review/rebase pass on this one and then merging it shortly after 6.14 is out. That will give us a full dev cycle to catch any extra regressions.
@gwenya thanks for the great work!
@gwenya FYI, I'm seeing a couple of potentially related CI failures in Jenkins today. One is a network I/O limit validation which suggests that the limits are no longer being applied for some reason.
Another is a macvlan related network test which is now failing to give us an IPv6 address.
I'll look at these two a bit closer this week. It shouldn't be too difficult to figure out what's going on and fix it and that's exactly why I wanted this merged right after a stable release so we can catch those weird edge cases ;)
@stgraber At a quick glance, I think I'm not setting the all multicast flag properly, could that be the cause for the IPv6 issue?
Quite likely. IPv6 actively relies on multicast for router advertisements.
We also got this is failure that showed up a few hours ago: https://jenkins.linuxcontainers.org/job/incus-test-usb/651/console
Ah yeah, we should probably ignore that error in Flush.
Should I push fixes for these things in here or in a new PR?
In a new PR since we've merged this one. Can still come out of the same branch though.