gluetun icon indicating copy to clipboard operation
gluetun copied to clipboard

Bug: Connectivity is lost once gluetun container is restarted

Open rakbladsvalsen opened this issue 3 years ago • 64 comments

Is this urgent?: No (kinda it is, since this causes complete connection loss if this "bug" happens)

Host OS: Tested on both Fedora 34 and (up-to-date) Arch Linux ARM (32bit/RPi 4B)

CPU arch or device name: amd64 & armv7

What VPN provider are you using: NordVPN

What are you using to run your container?: Docker Compose

What is the version of the program

x64 & armv7: Running version latest built on 2021-09-23T17:23:28Z (commit 985cf7b)

Steps to reproduce issue:

  1. Using recommended docker-compose.yml, configure gluetun and another container (in my case, xyz, though it can be something like qbittorrent or whatever you want) to use gluetun's network stack. Publish xyz' ports through gluetun's network stack.
  2. Either: a) restart gluetun using good ol' docker restart gluetun, or b) manually cause a temporary network problem in such way that gluetun container dies/exits. Then restart gluetun.
  3. Now try to use xyz through its published ports: you'll receive a connection refused error, unless you restart xyz service again. You can also -exec it into the container and run curl/wget/ping/etc:

Expected behavior: xyz should have internet connectivity through gluetun's network stack and be accesible through gluetun's published/exposed ports, even if gluetun is restarted. This is, unfortunately not the case: xyz's network stack just dies, no data in, no data out.

Additional notes:

  1. I did use FIREWALL_OUTBOUND_SUBNETS - didn't make a difference.
  2. I noticed quite interesting stuff once gluetun is restarted: a) Routing entries from containers using network_mode: service:gluetun completely disappear. b) Restarting gluetun doesn't bring back original routing tables. c) NetworkMode seems to be okay.

Terminal example

# At this point, gluetun has been manually restarted. Then I exec -it'd into an affected container that was using gluetun's network stack:
/app # ip ro sh 
/app # 
[root@fedora pepe]# docker restart xyz
[root@fedora pepe]# docker exec -it xyz /bin/sh 
/app # ip ro sh
0.0.0.0/1 via 10.8.1.1 dev tun0 
default via 172.17.0.1 dev eth0 
10.8.1.0/24 dev tun0 scope link  src 10.8.1.4 
37.120.209.219 via 172.17.0.1 dev eth0 
128.0.0.0/1 via 10.8.1.1 dev tun0 
172.17.0.0/16 dev eth0 scope link  src 172.17.0.2 

Brief docker inspect output from affected container

# snip
            "NetworkMode": "container:f77af999d9de92af66094dd9db0f854f1a2da9ceabddc47239bc5b89f577247f",
            "PortBindings": {},
            "RestartPolicy": {
                "Name": "unless-stopped",
                "MaximumRetryCount": 0
            },

f77[...] is gluetun's container ID.

Full gluetun logs:

2021/09/24 16:39:47 INFO Alpine version: 3.14.2
2021/09/24 16:39:47 INFO OpenVPN 2.4 version: 2.4.11
2021/09/24 16:39:47 INFO OpenVPN 2.5 version: 2.5.2
2021/09/24 16:39:47 INFO Unbound version: 1.13.2
2021/09/24 16:39:47 INFO IPtables version: v1.8.7
2021/09/24 16:39:47 INFO Settings summary below:
|--VPN:
   |--Type: openvpn
   |--OpenVPN:
      |--Version: 2.5
      |--Verbosity level: 1
      |--Network interface: tun0
      |--Run as root: enabled
   |--Nordvpn settings:
      |--Regions: mexico, sweden
      |--OpenVPN selection:
         |--Protocol: udp
|--DNS:
   |--Plaintext address: 1.1.1.1
   |--DNS over TLS:
      |--Unbound:
          |--DNS over TLS providers:
              |--Cloudflare
          |--Listening port: 53
          |--Access control:
              |--Allowed:
                  |--0.0.0.0/0
                  |--::/0
          |--Caching: enabled
          |--IPv4 resolution: enabled
          |--IPv6 resolution: disabled
          |--Verbosity level: 1/5
          |--Verbosity details level: 0/4
          |--Validation log level: 0/2
          |--Username: 
      |--Blacklist:
         |--Blocked categories: malicious
         |--Additional IP networks blocked: 13
      |--Update: every 24h0m0s
|--Firewall:
   |--Outbound subnets: 192.168.0.0/24
|--Log:
   |--Level: INFO
|--System:
   |--Process user ID: 1000
   |--Process group ID: 1000
   |--Timezone: REDACTED
|--Health:
   |--Server address: 127.0.0.1:9999
   |--Address to ping: github.com
   |--VPN:
      |--Initial duration: 6s
      |--Addition duration: 5s
|--HTTP control server:
   |--Listening port: 8000
   |--Logging: enabled
|--Public IP getter:
   |--Fetch period: 12h0m0s
   |--IP file: /tmp/gluetun/ip
|--Github version information: enabled
2021/09/24 16:39:47 INFO routing: default route found: interface eth0, gateway 172.17.0.1
2021/09/24 16:39:47 INFO routing: local ethernet link found: eth0
2021/09/24 16:39:47 INFO routing: local ipnet found: 172.17.0.0/16
2021/09/24 16:39:47 INFO routing: default route found: interface eth0, gateway 172.17.0.1
2021/09/24 16:39:47 INFO routing: adding route for 0.0.0.0/0
2021/09/24 16:39:47 INFO firewall: firewall disabled, only updating allowed subnets internal list
2021/09/24 16:39:47 INFO routing: default route found: interface eth0, gateway 172.17.0.1
2021/09/24 16:39:47 INFO routing: adding route for 192.168.0.0/24
2021/09/24 16:39:47 INFO TUN device is not available: open /dev/net/tun: no such file or directory; creating it...
2021/09/24 16:39:47 INFO firewall: enabling...
2021/09/24 16:39:47 INFO firewall: enabled successfully
2021/09/24 16:39:47 INFO dns over tls: using plaintext DNS at address 1.1.1.1
2021/09/24 16:39:47 INFO healthcheck: listening on 127.0.0.1:9999
2021/09/24 16:39:47 INFO http server: listening on :8000
2021/09/24 16:39:47 INFO firewall: setting VPN connection through firewall...
2021/09/24 16:39:47 INFO openvpn: OpenVPN 2.5.2 armv7-alpine-linux-musleabihf [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD] built on May  4 2021
2021/09/24 16:39:47 INFO openvpn: library versions: OpenSSL 1.1.1l  24 Aug 2021, LZO 2.10
2021/09/24 16:39:47 INFO openvpn: TCP/UDP: Preserving recently used remote address: [AF_INET]86.106.103.27:1194
2021/09/24 16:39:47 INFO openvpn: UDP link local: (not bound)
2021/09/24 16:39:47 INFO openvpn: UDP link remote: [AF_INET]86.106.103.27:1194
2021/09/24 16:39:48 WARN openvpn: 'link-mtu' is used inconsistently, local='link-mtu 1633', remote='link-mtu 1634'
2021/09/24 16:39:48 WARN openvpn: 'comp-lzo' is present in remote config but missing in local config, remote='comp-lzo'
2021/09/24 16:39:48 INFO openvpn: [se-nl8.nordvpn.com] Peer Connection Initiated with [AF_INET]86.106.103.27:1194
2021/09/24 16:39:49 INFO openvpn: TUN/TAP device tun0 opened
2021/09/24 16:39:49 INFO openvpn: /sbin/ip link set dev tun0 up mtu 1500
2021/09/24 16:39:49 INFO openvpn: /sbin/ip link set dev tun0 up
2021/09/24 16:39:49 INFO openvpn: /sbin/ip addr add dev tun0 10.8.8.14/24
2021/09/24 16:39:49 INFO openvpn: Initialization Sequence Completed
2021/09/24 16:39:49 INFO dns over tls: downloading DNS over TLS cryptographic files
2021/09/24 16:39:50 INFO healthcheck: healthy!
2021/09/24 16:39:53 INFO dns over tls: downloading hostnames and IP block lists
2021/09/24 16:40:11 INFO dns over tls: init module 0: validator
2021/09/24 16:40:11 INFO dns over tls: init module 1: iterator
2021/09/24 16:40:11 INFO dns over tls: start of service (unbound 1.13.2).
2021/09/24 16:40:12 INFO dns over tls: generate keytag query _ta-4a5c-4f66. NULL IN
2021/09/24 16:40:13 INFO dns over tls: generate keytag query _ta-4a5c-4f66. NULL IN
2021/09/24 16:40:16 INFO dns over tls: ready
2021/09/24 16:40:18 INFO vpn: You are running on the bleeding edge of latest!
2021/09/24 16:40:19 INFO ip getter: Public IP address is 213.232.87.176 (Netherlands, North Holland, Amsterdam)

docker-compose.yml:

  gluetun:
    image: qmcgaw/gluetun
    container_name: gluetun
    restart: unless-stopped
    cap_add:
      - NET_ADMIN
    ports:
      - 4533:4533 #navidrome
    environment:
      - OPENVPN_USER=REDACTED
      - OPENVPN_PASSWORD=REDACTED
      - VPNSP=nordvpn
      - VPN_TYPE=openvpn
      - REGION=REDACTED
      - TZ=REDACTED
      - FIREWALL_OUTBOUND_SUBNETS=192.168.0.0/24

# navidrome (can be literally anything else)
  navidrome:
    image: deluan/navidrome:develop
    container_name: navidrome
    restart: unless-stopped
    environment:
      - PGID=1000
      - PUID=1000
    volumes:
      - dockervolume:/music:ro
    network_mode: service:gluetun
    depends_on:
      - gluetun

Nonetheless I'd like to thank you for creating gluetun. I'd be more than happy to help you fix this issue if this is a gluetun bug. Hopefully it's a misconfiguration in my side.

rakbladsvalsen avatar Sep 24 '21 22:09 rakbladsvalsen

Hey there! Thanks for the detailed issue!

It is a well known Docker problem I need to workaround. Let's keep this opened for now although there is at least one duplicate issue about this problem somewhere in the issues.

Note this only happens if gluetun is updated and uses a different image (afaik).

For now, you might want to have all your gluetun and connected containers in a single docker-compose.yml and docker-compose down && docker-compose up -d them (what I do).

I'm developing https://github.com/qdm12/deunhealth and should add a feature tailored for this problem soon (give it 1-5 days), feel free to subscribe to releases on that side repo. That way it would watch your containers and restart your connected containers if gluetun gets updated & restarted.

qdm12 avatar Sep 24 '21 23:09 qdm12

Thank you for the answer @qdm12.

It does seem to be indeed a Docker problem just as you said and unfortunately they seem a bit reluctant to discuss possible solutions for the issue, unfortunately. :(

For the time being, there's a temporary ugly, brutal, but 100% working fix. Maybe it would be worth mentioning it in the wiki/docker-compose.yml example? Although there are some gotchas, since it completely replaces the original healthcheck command, and some images don't include either curl or wget. Currently I'm probing example.com every minute on child containers attached to gluetun's network stack and so far so good.

I just subscribed to deunhealth, seems promising and probably even better than things like autoheal due to the network fix thing. I'll make sure to check it out in a week (or earlier, as you deem appropiate) and provide feedback/do some testing.

rakbladsvalsen avatar Sep 25 '21 18:09 rakbladsvalsen

Similar conversation in #504 to be concluded.

qdm12 avatar Sep 27 '21 00:09 qdm12

I have the same thing, when i restart Gluetun, it doesn't want to start the containers within the same network_mode. Only difference is that i configured it with: network_mode: 'container:VPN'.

I think when i restart or recreate the Gluetun container it gets a different ID.

What would be the solution to this problem?

ksbijvank avatar Oct 01 '21 14:10 ksbijvank

Stumbled across this issue while researching ways to restart dependent containers once gluetun is recreated with a new image (via Watchtower). https://github.com/qdm12/deunhealth seems like it might work, but I wanted to make sure I understand the use case.

If I have a number of services with: network_mode: container:gluetun

However, when the gluetun container restarts, the dependent containers don't actually end up gettin marked unhealthy, they just lose connectivity.

I'm wondering if you've updated deunhealth yet to include this function.

oester avatar Oct 04 '21 14:10 oester

No sorry, but I'll get to it soon.

Ideally, there is a way to re-attach the disconnected containers to gluetun without restarting them (I guess with Docker's Go API since I doubt the docker cli supports such thing). That would work by marking each connected container with a label to indicate this network re-attachment.

If there isn't, I'll setup something to cascade the restart from gluetun to connected containers, probably using labels to avoid any surprise (mark gluetun as a parent container with a unique id, and mark all connected containers as child containers with that same id).

qdm12 avatar Oct 04 '21 14:10 qdm12

For the time being, if anyone wants a dirty, cheap solution, here's my current setup:

  autoheal:
   ... snip ...
  literallyanything:
    image: blahblah
    container_name: blahblah
    network_mode: service:gluetun
    restart: unless-stopped
    healthcheck:
      test: "curl -sf https://example.com  || exit 1"
      interval: 1m
      timeout: 10s
      retries: 1

This will only work with containers where curl is already preinstalled. There are docker images that include wget but not curl, in which case you can replace test command with wget --no-verbose --tries=1 --spider https://example.com/ || exit 1. You can also use qdm12's deunhealth instead of autoheal.

rakbladsvalsen avatar Oct 04 '21 21:10 rakbladsvalsen

Any progress or resolution to this, either in gluetun or deunhealth?

oester avatar Nov 10 '21 14:11 oester

I have bits and pieces for it, but I am moving country + visiting family + starting a new job right now, so it might take at least 2 weeks for me to finish it up, sorry about that. But it's at the top of my OSS things-to-do list, so it won't be forgotten :wink:

qdm12 avatar Nov 10 '21 14:11 qdm12

I'd also like to thank you for creating gluetun and to say this is a very good project. Any progress on this?

nfribeiro avatar Jan 12 '22 18:01 nfribeiro

Any update on this by any chance?

pau1h avatar Mar 07 '23 00:03 pau1h

I'm not really sure. I turned off the Watchtower container and since then my setup worked flawlessly. It's a workaround, but it's all I know so far.

Op di 7 mrt 2023 om 01:25 schreef Paul Hawkins @.***>:

Any update on this by any chance?

— Reply to this email directly, view it on GitHub https://github.com/qdm12/gluetun/issues/641#issuecomment-1457265377, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIU3IYFQ7MQ6JKOGY2H32TW2Z57BANCNFSM5EW2LD4Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

iewl avatar Mar 08 '23 10:03 iewl

Any news or progress on this issue?

Manfred73 avatar Mar 09 '23 21:03 Manfred73

following

vdrover avatar May 09 '23 22:05 vdrover

Since I also have this problem, I would like to report it here and find out if and how it continues. Thank you!

knorrre avatar May 26 '23 12:05 knorrre

Having the same behavior. When gluetun is recreated, every other container in the same network_mode needs to also restart

karserasl avatar May 29 '23 09:05 karserasl

Stupid idea to solve this:

Gluetun is given the docker socket, and a list of containers to restart once it comes back up?

I do not think Docker is going to solve this bug. This bug has existed effectively forever.

aaomidi avatar Jul 23 '23 03:07 aaomidi

Welp, had to prove myself wrong: https://github.com/docker/compose/pull/10284

aaomidi avatar Jul 23 '23 04:07 aaomidi

This solution has worked well for me: https://github.com/qdm12/gluetun/issues/641#issuecomment-933856220. I've not had any issues with container connectivity after adding that. Seems like an all right fix to me.

ismay avatar Jul 23 '23 06:07 ismay

Following, I have this issue. Thank you for the awesome service.

Dunny88 avatar Jul 28 '23 22:07 Dunny88

@ismay The workaround wasn't bad, but doesn't work for distroless containers which don't have curl or wget or really anything to do a healthcheck from the inside

melyux avatar Jul 28 '23 23:07 melyux

@ismay The workaround wasn't bad, but doesn't work for distroless containers which don't have curl or wget or really anything to do a healthcheck from the inside

Ah yeah that's true

ismay avatar Jul 29 '23 06:07 ismay

Is there a chance this issue will be ever resolved? Thanks for providing gluetun, happy user for a while.

sjoerdschouten avatar Aug 26 '23 07:08 sjoerdschouten

The "network_mode: service:gluetun" statement is incredibly different from a normal "networks" statement.

People's hopes and some's expectations here are a bit overblown (and the "bug" monicker, likely doesn't help) ... if you use a container as a network stack component (router with layer features) ... which you are with that configuration ... and you restart that container its state, and more importantly infrastructure, is gone. Your application isn't going to recover.

The author may be able to provide runtime resilience in the container or even some high availability (see other PR where people want to have multiple gluetuns concurrently), but if wireguard or openvpn were to fall down all he can do is try and reestablish. All the states, port forwards, etc are all gone and will need to be reconstructed.

For example, complaining about connection refused while a router is offline (in this case its ... off) ... yeah, not only is that vpn not running anymore the whole stack (eg, network service) is not running. Let's be realistic.

With "network_mode" ... your container's networking is using another container for its networking ... that also provides additional network services (eg, vpn, proxies, dns, etc).

This gimmick of using the docker feature for easily aggregating containers into that network stack ... great, but this is the downside ... they are all aggregated into that stack.

If you filed "bugs" with docker such as "network_mode services should retain connections and state while service container offline and ..." it's not feasible ... in a single container. And docker plugins (an entirely different thing) have gone into different roadmaps.

Some events, like a vpn falling down, not causing the container to fully restart (and then all the aggregated containers) ... that is likely a resolvable thing.

gmillerd avatar Aug 26 '23 08:08 gmillerd

This is a bug with docker and how docker handles network_mode:container. Until they fix that bug, we're basically stuck.

aaomidi avatar Aug 26 '23 18:08 aaomidi

If anybody wants to give it a try, I have written cascandaliato/docker-restarter. Right now it covers only one scenario: if A depends on B and B restarts then restart A.

cascandaliato avatar Sep 23 '23 14:09 cascandaliato

Gluetun updated twice in a few minutes, which in turn i had to restart the whole stack depending on it, twice.

@cascandaliato I will try for sure.

karserasl avatar Sep 23 '23 14:09 karserasl

If anybody wants to give it a try, I have written cascandaliato/docker-restarter. Right now it covers only one scenario: if A depends on B and B restarts then restart A.

That works a treat. Thank you.

I did find I had to put it outside of the stack. Your container would attempt to restart itself. But actually stop if it was inside the stack, and not restart the other container's. Possibly a issue from my side thought

Blavkentropy1 avatar Sep 24 '23 08:09 Blavkentropy1

@Blavkentropy1, I've opened cascandaliato/docker-restarter#2 to keep track of that problem but I'll need your help to understand what's happening. Whenever you have time, we can continue the discussion in the other issue because I don't want to add noise to this one.

cascandaliato avatar Sep 24 '23 11:09 cascandaliato

Also having this issue sadly!

Drove myself mad thinking it was a problem with defining FIREWALL_OUTBOUND_SUBNETS.

frank-besson avatar Oct 10 '23 18:10 frank-besson