Debug Help - Container loses connection (healthcheck fails) but no errors seen
I am using Truenas Scale 24.10.1, kernel 6.6.44
I find that after 'an amount of time' (currently in the process of testing if its consistent - mins/hours not days) the VPN container health check fails and access via the tunnel is no more, but there is nothing in the container output to suggest anything is wrong.
I have tried using KEEPALIVE and it does not appear to help, here is my docker-compose:
services:
vpn:
cap_add:
- NET_ADMIN
environment:
- LOC=uk_manchester
- USER=<removed>
- PASS=<removed>
- LOCAL_NETWORK=192.168.1.20/32,192.168.1.2/32
- KEEPALIVE=5
- DEBUG=1
healthcheck:
interval: 60s
retries: 3
start_interval: 3s
start_period: 30s
test: ping -c 1 www.google.com || exit 1
timeout: 10s
image: thrnz/docker-wireguard-pia
pull_policy: always
restart: always
sysctls:
- net.ipv4.conf.all.src_valid_mark=1
- net.ipv6.conf.default.disable_ipv6=1
- net.ipv6.conf.all.disable_ipv6=1
- net.ipv6.conf.lo.disable_ipv6=1
I am currently running tcpdump on the container and things look good to me, i can see the ping every 60 seconds and what I assume is a keep alive every 5 seconds.
16:52:23.816627 wg0 Out IP 10.20.185.147.33549 > 10.0.0.243.53: 26087+ A? www.google.com. (32)
16:52:23.816639 wg0 Out IP 10.20.185.147.33549 > 10.0.0.242.53: 26087+ A? www.google.com. (32)
16:52:23.816644 wg0 Out IP 10.20.185.147.33549 > 10.0.0.243.53: 26310+ AAAA? www.google.com. (32)
16:52:23.816650 wg0 Out IP 10.20.185.147.33549 > 10.0.0.242.53: 26310+ AAAA? www.google.com. (32)
16:52:23.816653 eth0 Out IP 9b6e5a5981f5.51177 > 45.133.172.245.1337: UDP, length 96
16:52:23.816685 eth0 Out IP 9b6e5a5981f5.51177 > 45.133.172.245.1337: UDP, length 96
16:52:23.816692 eth0 Out IP 9b6e5a5981f5.51177 > 45.133.172.245.1337: UDP, length 96
16:52:23.816697 eth0 Out IP 9b6e5a5981f5.51177 > 45.133.172.245.1337: UDP, length 96
16:52:23.835154 eth0 In IP 45.133.172.245.1337 > 9b6e5a5981f5.51177: UDP, length 128
16:52:23.835156 eth0 In IP 45.133.172.245.1337 > 9b6e5a5981f5.51177: UDP, length 112
16:52:23.835182 wg0 In IP 10.0.0.242.53 > 10.20.185.147.33549: 26310 1/0/0 AAAA 2a00:1450:4001:80f::2004 (60)
16:52:23.835185 eth0 In IP 45.133.172.245.1337 > 9b6e5a5981f5.51177: UDP, length 128
16:52:23.835187 eth0 In IP 45.133.172.245.1337 > 9b6e5a5981f5.51177: UDP, length 112
16:52:23.835192 wg0 In IP 10.0.0.242.53 > 10.20.185.147.33549: 26087 1/0/0 A 142.250.185.100 (48)
16:52:23.835196 wg0 In IP 10.0.0.243.53 > 10.20.185.147.33549: 26310 1/0/0 AAAA 2a00:1450:4001:80f::2004 (60)
16:52:23.835199 wg0 In IP 10.0.0.243.53 > 10.20.185.147.33549: 26087 1/0/0 A 142.250.185.100 (48)
16:52:23.835319 wg0 Out IP 10.20.185.147 > fra16s49-in-f4.1e100.net: ICMP echo request, id 253, seq 0, length 64
16:52:23.835356 eth0 Out IP 9b6e5a5981f5.51177 > 45.133.172.245.1337: UDP, length 128
16:52:23.872240 eth0 In IP 45.133.172.245.1337 > 9b6e5a5981f5.51177: UDP, length 128
16:52:23.872312 wg0 In IP fra16s49-in-f4.1e100.net > 10.20.185.147: ICMP echo reply, id 253, seq 0, length 64
16:52:28.855379 eth0 In ARP, Request who-has 9b6e5a5981f5 tell 172.16.1.1, length 28
16:52:28.855461 eth0 Out ARP, Reply 9b6e5a5981f5 is-at 02:42:ac:10:01:02 (oui Unknown), length 28
16:52:29.111491 eth0 Out IP 9b6e5a5981f5.51177 > 45.133.172.245.1337: UDP, length 32
16:52:34.235163 eth0 Out ARP, Request who-has 172.16.1.1 tell 9b6e5a5981f5, length 28
16:52:34.235240 eth0 In ARP, Reply 172.16.1.1 is-at 02:42:f3:06:fd:2f (oui Unknown), length 28
16:52:34.235299 eth0 Out IP 9b6e5a5981f5.51177 > 45.133.172.245.1337: UDP, length 32
16:52:39.355230 eth0 Out IP 9b6e5a5981f5.51177 > 45.133.172.245.1337: UDP, length 32
16:52:44.471291 eth0 Out IP 9b6e5a5981f5.51177 > 45.133.172.245.1337: UDP, length 32
16:52:49.591168 eth0 Out IP 9b6e5a5981f5.51177 > 45.133.172.245.1337: UDP, length 32
16:52:54.711521 eth0 Out IP 9b6e5a5981f5.51177 > 45.133.172.245.1337: UDP, length 32
16:52:59.838008 eth0 Out IP 9b6e5a5981f5.51177 > 45.133.172.245.1337: UDP, length 32
16:53:04.951264 eth0 Out IP 9b6e5a5981f5.51177 > 45.133.172.245.1337: UDP, length 32
A) Is there anything glaringly obvious wrong looking here B) I presume the keepalives happening over eth0 is correct (i expected wg0 - but I don't know a huge amount about the wg protocol). C) Can anyone suggest anything additional I can be doing to understand what is going wrong?
I am certain I have used this container before in the past, and it worked absolutely fine, I stopped using it a while ago and now I am setting things back up again, they only thing that has changed in that time is the OS (moved to newer kernel+os - truenas) I have also switched internet provider (to BRSK in the UK, they use CGNAT but I also pay for the static IPv4 option).
Everything works perfectly when the container is first started, all services attached to the container function as expected.
If it's important: my router is running OpenWRT and its not impossible i've got something setup wrong on there - I have a few vlans and things, but other than that, fairly standard.
I will post the output from the container and the tcpdump when it eventually stops working (i have just setup this test for the first time, removing any containers attached to the vpn container to rule out any issues there).
Nothing there stands out as being obviously wrong. Once the vpn connection is up the container scripts don't do an awful lot (apart from the port forwarding script) and just go to sleep until the container shuts down. A 'disconnect' wouldn't show in the logs as such, even with DEBUG=1 set, as it just prints out any bash commands as they're run.
I'm not sure how the keepalive packets work technically, but I think they're more a way of keeping things alive NAT wise for incoming packets. There's a bit about them here - see the PersistentKeepalive stuff. It shouldn't be needed if there's regular tunnel traffic - in this case the regular healthcheck pings would probably prevent any issues. Maybe it's not considered internal tunnel traffic as such, so only the encrypted packets are shown on their way out via eth0 on the traffic dump.
I don't think Wireguard goes down in the same way that OpenVPN might, so while the remote vpn endpoint might stop responding, internally container wise I'm not sure if anything changes. Routing wise everything should still be going down the tunnel, but just not receiving any response from the other end.
I think PIA's servers stop responding after sometime being idle, but I'm not sure how long that is. The generated Wireguard keys don't last forever. They'd also go down if/when PIA do stuff at their end (restarts/changes etc.), but that probably isn't very often. I'm not sure if using KEEPALIVE alone is enough to prevent an idle timeout - it might not be counted as traffic passing through the vpn at their end if it's at the protocol level (if that makes sense), but I've seen connections last for weeks with only the healthcheck ping traffic going on.
I'll let you know if anything comes to mind, but setup wise it looks fine as-is.
I've been having the same issue for the past few months. Simply restarting the container does not resolve the issue. I have to fully recreate the container each time. Nothing shows in logs but health check fails. Running curl ifconfig.me hangs until I recreate.