lxd-ci: Debug why `tests/network-ovn` peering test fails on GHA runners but succeeds locally
PURGE_LXD=1 ./bin/local-run tests/network-ovn latest/edge peering works locally even when installing the 6.5 Azure kernel in a LXD VM. This however consistently fails on GHA runners at this point:
echo "==> Test that pinging external addresses between networks does worth without peering (goes via uplink)"
lxc exec ovn2 --project=ovn2 -- ping -nc1 -4 -w5 198.51.100.2
lxc exec ovn2 --project=ovn2 -- ping -nc1 -6 -w5 2001:db8:1:2::2
I captured some information from a tmate debug session:
root@fv-az520-983:~# lxc ls --all-projects
+---------+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| PROJECT | NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| ovn1 | ovn1 | RUNNING | 198.51.100.2 (eth0) | fd42:a663:6118:2961:216:3eff:fed0:54b9 (eth0) | CONTAINER | 0 |
| | | | 198.51.100.1 (eth0) | 2001:db8:1:2::2 (eth0) | | |
| | | | 10.153.20.2 (eth0) | 2001:db8:1:2::1 (eth0) | | |
+---------+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| ovn2 | ovn2 | RUNNING | 10.143.162.2 (eth0) | fd42:489c:e694:94ea:216:3eff:fec1:59b3 (eth0) | CONTAINER | 0 |
+---------+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
root@fv-az520-983:~# lxc network ls --project ovn1
+------+------+---------+----------------+---------------------------+-------------+---------+---------+
| NAME | TYPE | MANAGED | IPV4 | IPV6 | DESCRIPTION | USED BY | STATE |
+------+------+---------+----------------+---------------------------+-------------+---------+---------+
| ovn1 | ovn | YES | 10.153.20.1/24 | fd42:a663:6118:2961::1/64 | | 1 | CREATED |
+------+------+---------+----------------+---------------------------+-------------+---------+---------+
root@fv-az520-983:~# lxc network ls --project ovn2
+------+------+---------+-----------------+---------------------------+-------------+---------+---------+
| NAME | TYPE | MANAGED | IPV4 | IPV6 | DESCRIPTION | USED BY | STATE |
+------+------+---------+-----------------+---------------------------+-------------+---------+---------+
| ovn2 | ovn | YES | 10.143.162.1/24 | fd42:489c:e694:94ea::1/64 | | 0 | CREATED |
+------+------+---------+-----------------+---------------------------+-------------+---------+---------+
root@fv-az520-983:~# lxc network peer list ovn1 --project ovn1
+------+-------------+------+-------+
| NAME | DESCRIPTION | PEER | STATE |
+------+-------------+------+-------+
root@fv-az520-983:~# lxc network peer list ovn2 --project ovn2
+---------+-------------+---------+---------+
| NAME | DESCRIPTION | PEER | STATE |
+---------+-------------+---------+---------+
| ovn2foo | | Unknown | ERRORED |
+---------+-------------+---------+---------+
The firewall rules look OK except for that unusual security table one:
+ nft list ruleset
table inet lxd {
chain pstrt.lxdbr0 {
type nat hook postrouting priority srcnat; policy accept;
ip saddr 10.10.10.0/24 ip daddr != 10.10.10.0/24 masquerade
ip6 saddr fd42:4242:4242:1010::/64 ip6 daddr != fd42:4242:4242:1010::/64 masquerade
}
chain fwd.lxdbr0 {
type filter hook forward priority filter; policy accept;
ip version 4 oifname "lxdbr0" accept
ip version 4 iifname "lxdbr0" accept
ip6 version 6 oifname "lxdbr0" accept
ip6 version 6 iifname "lxdbr0" accept
}
chain in.lxdbr0 {
type filter hook input priority filter; policy accept;
iifname "lxdbr0" tcp dport 53 accept
iifname "lxdbr0" udp dport 53 accept
iifname "lxdbr0" icmp type { destination-unreachable, time-exceeded, parameter-problem } accept
iifname "lxdbr0" udp dport 67 accept
iifname "lxdbr0" icmpv6 type { destination-unreachable, packet-too-big, time-exceeded, parameter-problem, nd-router-solicit, nd-neighbor-solicit, nd-neighbor-advert, mld2-listener-report } accept
iifname "lxdbr0" udp dport 547 accept
}
chain out.lxdbr0 {
type filter hook output priority filter; policy accept;
oifname "lxdbr0" tcp sport 53 accept
oifname "lxdbr0" udp sport 53 accept
oifname "lxdbr0" icmp type { destination-unreachable, time-exceeded, parameter-problem } accept
oifname "lxdbr0" udp sport 67 accept
oifname "lxdbr0" icmpv6 type { destination-unreachable, packet-too-big, time-exceeded, parameter-problem, echo-request, nd-router-advert, nd-neighbor-solicit, nd-neighbor-advert, mld2-listener-report } accept
oifname "lxdbr0" udp sport 547 accept
}
}
table ip security {
chain OUTPUT {
type filter hook output priority 150; policy accept;
meta l4proto tcp ip daddr 168.63.129.16 tcp dport 53 counter packets 0 bytes 0 accept
meta l4proto tcp ip daddr 168.63.129.16 skuid 0 counter packets 1446 bytes 405667 accept
meta l4proto tcp ip daddr 168.63.129.16 ct state invalid,new counter packets 0 bytes 0 drop
}
}
However, deleting it didn't help, the pings still don't go through.
Switch from ping to nc -zv ... 22 doesn't help either, same timeout.
The azure.pcap.gz gz compressed pcap shows that at some point, the echo reply is just vanishing. This contrasts with a capture from a local VM (with the Azure kernel): local.pcap.gz
@tomponline any idea as to what's going on here? Or what I could try next? If not, my next move will be to try on Canonical-hosted runners.
@simondeziel how do I get the test scripts to stop destroying the env when i run it manually?
I need to the env left in the same state as it was when the failure occurs.
@tomponline the cleanup handling is reworked in https://github.com/canonical/lxd-ci/pull/94
Here's something funny, running sudo tcpdump -nn -i lxdbr0 (which switches the bridge into promiscuous mode) makes it work, and exiting tcpdump breaks it again :)
lxc network set lxdbr0 ipv4.nat=false fixes it.
Considering whether we should alter the SNAT rule such that it only applied to traffic leaving the bridge via a non-bridge interface, e.g.
nft add rule inet lxd pstrt.lxdbr0 ip saddr 10.10.10.0/24 ip daddr != 10.10.10.0/24 oif != lxdbr0 masquerade
As this fixes it also by not performing SNAT between intra-network traffic when the source address doesn't match that of the main network.
https://github.com/canonical/lxd-ci/actions/runs/11300430475/job/31433290050?pr=311#step:10:3879 has the rmmod br_netfilter workaround but failed nevertheless:
==> Ping internal and external NIC route addresses over peer connection
+ lxc exec ovn2 --project=ovn2 -- ping -nc1 -4 -w5 198.51.100.1
PING 198.51.100.1 (198.51.100.1) 56(84) bytes of data.
--- 198.51.100.1 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4078ms
That said, tests/network-ovn "feels" more reliable... for what it's worth.
Not sure why it closed the issue, I've just rebased my fork..
I dont know why other people's forks keep closing this.