lxd icon indicating copy to clipboard operation
lxd copied to clipboard

lxd-ci: Debug why `tests/network-ovn` peering test fails on GHA runners but succeeds locally

Open simondeziel opened this issue 1 year ago • 9 comments

PURGE_LXD=1 ./bin/local-run tests/network-ovn latest/edge peering works locally even when installing the 6.5 Azure kernel in a LXD VM. This however consistently fails on GHA runners at this point:

    echo "==> Test that pinging external addresses between networks does worth without peering (goes via uplink)"
    lxc exec ovn2 --project=ovn2 -- ping -nc1 -4 -w5 198.51.100.2
    lxc exec ovn2 --project=ovn2 -- ping -nc1 -6 -w5 2001:db8:1:2::2

I captured some information from a tmate debug session:

root@fv-az520-983:~# lxc ls --all-projects
+---------+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| PROJECT | NAME |  STATE  |        IPV4         |                     IPV6                      |   TYPE    | SNAPSHOTS |
+---------+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| ovn1    | ovn1 | RUNNING | 198.51.100.2 (eth0) | fd42:a663:6118:2961:216:3eff:fed0:54b9 (eth0) | CONTAINER | 0         |
|         |      |         | 198.51.100.1 (eth0) | 2001:db8:1:2::2 (eth0)                        |           |           |
|         |      |         | 10.153.20.2 (eth0)  | 2001:db8:1:2::1 (eth0)                        |           |           |
+---------+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| ovn2    | ovn2 | RUNNING | 10.143.162.2 (eth0) | fd42:489c:e694:94ea:216:3eff:fec1:59b3 (eth0) | CONTAINER | 0         |
+---------+------+---------+---------------------+-----------------------------------------------+-----------+-----------+

root@fv-az520-983:~# lxc network ls --project ovn1
+------+------+---------+----------------+---------------------------+-------------+---------+---------+
| NAME | TYPE | MANAGED |      IPV4      |           IPV6            | DESCRIPTION | USED BY |  STATE  |
+------+------+---------+----------------+---------------------------+-------------+---------+---------+
| ovn1 | ovn  | YES     | 10.153.20.1/24 | fd42:a663:6118:2961::1/64 |             | 1       | CREATED |
+------+------+---------+----------------+---------------------------+-------------+---------+---------+
root@fv-az520-983:~# lxc network ls --project ovn2
+------+------+---------+-----------------+---------------------------+-------------+---------+---------+
| NAME | TYPE | MANAGED |      IPV4       |           IPV6            | DESCRIPTION | USED BY |  STATE  |
+------+------+---------+-----------------+---------------------------+-------------+---------+---------+
| ovn2 | ovn  | YES     | 10.143.162.1/24 | fd42:489c:e694:94ea::1/64 |             | 0       | CREATED |
+------+------+---------+-----------------+---------------------------+-------------+---------+---------+

root@fv-az520-983:~# lxc network peer list ovn1 --project ovn1
+------+-------------+------+-------+
| NAME | DESCRIPTION | PEER | STATE |
+------+-------------+------+-------+
root@fv-az520-983:~# lxc network peer list ovn2 --project ovn2
+---------+-------------+---------+---------+
|  NAME   | DESCRIPTION |  PEER   |  STATE  |
+---------+-------------+---------+---------+
| ovn2foo |             | Unknown | ERRORED |
+---------+-------------+---------+---------+

The firewall rules look OK except for that unusual security table one:

 + nft list ruleset
table inet lxd {
	chain pstrt.lxdbr0 {
		type nat hook postrouting priority srcnat; policy accept;
		ip saddr 10.10.10.0/24 ip daddr != 10.10.10.0/24 masquerade
		ip6 saddr fd42:4242:4242:1010::/64 ip6 daddr != fd42:4242:4242:1010::/64 masquerade
	}

	chain fwd.lxdbr0 {
		type filter hook forward priority filter; policy accept;
		ip version 4 oifname "lxdbr0" accept
		ip version 4 iifname "lxdbr0" accept
		ip6 version 6 oifname "lxdbr0" accept
		ip6 version 6 iifname "lxdbr0" accept
	}

	chain in.lxdbr0 {
		type filter hook input priority filter; policy accept;
		iifname "lxdbr0" tcp dport 53 accept
		iifname "lxdbr0" udp dport 53 accept
		iifname "lxdbr0" icmp type { destination-unreachable, time-exceeded, parameter-problem } accept
		iifname "lxdbr0" udp dport 67 accept
		iifname "lxdbr0" icmpv6 type { destination-unreachable, packet-too-big, time-exceeded, parameter-problem, nd-router-solicit, nd-neighbor-solicit, nd-neighbor-advert, mld2-listener-report } accept
		iifname "lxdbr0" udp dport 547 accept
	}

	chain out.lxdbr0 {
		type filter hook output priority filter; policy accept;
		oifname "lxdbr0" tcp sport 53 accept
		oifname "lxdbr0" udp sport 53 accept
		oifname "lxdbr0" icmp type { destination-unreachable, time-exceeded, parameter-problem } accept
		oifname "lxdbr0" udp sport 67 accept
		oifname "lxdbr0" icmpv6 type { destination-unreachable, packet-too-big, time-exceeded, parameter-problem, echo-request, nd-router-advert, nd-neighbor-solicit, nd-neighbor-advert, mld2-listener-report } accept
		oifname "lxdbr0" udp sport 547 accept
	}
}
table ip security {
	chain OUTPUT {
		type filter hook output priority 150; policy accept;
		meta l4proto tcp ip daddr 168.63.129.16 tcp dport 53 counter packets 0 bytes 0 accept
		meta l4proto tcp ip daddr 168.63.129.16 skuid 0 counter packets 1446 bytes 405667 accept
		meta l4proto tcp ip daddr 168.63.129.16 ct state invalid,new counter packets 0 bytes 0 drop
	}
}

However, deleting it didn't help, the pings still don't go through.

Switch from ping to nc -zv ... 22 doesn't help either, same timeout.

The azure.pcap.gz gz compressed pcap shows that at some point, the echo reply is just vanishing. This contrasts with a capture from a local VM (with the Azure kernel): local.pcap.gz

simondeziel avatar Mar 01 '24 20:03 simondeziel

@tomponline any idea as to what's going on here? Or what I could try next? If not, my next move will be to try on Canonical-hosted runners.

simondeziel avatar Mar 01 '24 20:03 simondeziel

@simondeziel how do I get the test scripts to stop destroying the env when i run it manually?

I need to the env left in the same state as it was when the failure occurs.

tomponline avatar Mar 05 '24 12:03 tomponline

@tomponline the cleanup handling is reworked in https://github.com/canonical/lxd-ci/pull/94

simondeziel avatar Mar 05 '24 17:03 simondeziel

Here's something funny, running sudo tcpdump -nn -i lxdbr0 (which switches the bridge into promiscuous mode) makes it work, and exiting tcpdump breaks it again :)

tomponline avatar Mar 06 '24 14:03 tomponline

lxc network set lxdbr0 ipv4.nat=false fixes it.

tomponline avatar Mar 06 '24 15:03 tomponline

Considering whether we should alter the SNAT rule such that it only applied to traffic leaving the bridge via a non-bridge interface, e.g.

nft add rule inet lxd pstrt.lxdbr0 ip saddr 10.10.10.0/24 ip daddr != 10.10.10.0/24 oif != lxdbr0 masquerade

As this fixes it also by not performing SNAT between intra-network traffic when the source address doesn't match that of the main network.

tomponline avatar Mar 06 '24 16:03 tomponline

https://github.com/canonical/lxd-ci/actions/runs/11300430475/job/31433290050?pr=311#step:10:3879 has the rmmod br_netfilter workaround but failed nevertheless:

==> Ping internal and external NIC route addresses over peer connection
+ lxc exec ovn2 --project=ovn2 -- ping -nc1 -4 -w5 198.51.100.1
PING 198.51.100.1 (198.51.100.1) 56(84) bytes of data.

--- 198.51.100.1 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4078ms

That said, tests/network-ovn "feels" more reliable... for what it's worth.

simondeziel avatar Oct 11 '24 23:10 simondeziel

Not sure why it closed the issue, I've just rebased my fork..

MusicDin avatar Oct 14 '24 07:10 MusicDin

I dont know why other people's forks keep closing this.

tomponline avatar Oct 15 '24 11:10 tomponline