bugs icon indicating copy to clipboard operation
bugs copied to clipboard

Adding pointopoint addresses to tun0 interface too quickly after opening intermittently fails quietly (openvpn)

Open collin-bachi-sp opened this issue 6 years ago • 10 comments

Issue Report

Bug

Container Linux Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1688.5.3
VERSION_ID=1688.5.3
BUILD_ID=2018-04-03-0547
PRETTY_NAME="Container Linux by CoreOS 1688.5.3 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"

Environment

What hardware/cloud provider/hypervisor is being used to run Container Linux?

CoreOS is running on an AWS EC2 instance. This bug was observed running inside Ubuntu and Alpine containers with host networking enabled. The failing command was issued by openvpn.

Expected Behavior

After the following commands run:

May 08 18:31:51 ip-10-0-1-98 docker[1293]: Tue May  8 18:31:51 2018 TUN/TAP device tun0 opened
May 08 18:31:51 ip-10-0-1-98 docker[1293]: Tue May  8 18:31:51 2018 TUN/TAP TX queue length set to 100
May 08 18:31:51 ip-10-0-1-98 docker[1293]: Tue May  8 18:31:51 2018 do_ifconfig, tt->did_ifconfig_ipv6_setup=0
May 08 18:31:52 ip-10-0-1-98 docker[1293]: Tue May  8 18:31:52 2018 /sbin/ifconfig tun0 172.16.1.18 pointopoint 172.16.1.17 mtu 1500

The pointopoint addresses should be added to the tun0 interface, and be visible in ifconfig output.

Actual Behavior

When the pointopoint command runs successfully, the interface looks like this:

tun0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>  mtu 1500
        inet 172.16.1.18  netmask 255.255.255.255  destination 172.16.1.17
        inet6 fe80::eba2:2516:2600:acb5  prefixlen 64  scopeid 0x20<link>
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 100  (UNSPEC)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 9  bytes 1438 (1.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

When the pointopoint command fails (quietly), the interface looks like this:

tun0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>  mtu 1500
        inet6 fe80::5ff2:ef48:6bbc:e9a8  prefixlen 64  scopeid 0x20<link>
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 100  (UNSPEC)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 14  bytes 2276 (2.2 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

When I modified the openvpn source to add a 1 second sleep before the pointopoint command, the pointopoint command went from intermittently failing to succeeding every time.

Reproduction Steps

  1. (Run openvpn client in an ubuntu container)
  2. Bring up a tun0 interface, then immediately add pointopoint addresses

Other Information

I haven't yet been able to pinpoint which part of the stack is responsible for the bug. It's possibly an openvpn or docker issue, but I'm led to believe that it is a coreos issue, because:

  • the docker container is using host networking and has network admin privileges
  • examining the openvpn source, these commands appear to be run in a sequential, straightforward manner
  • adding a 1 second sleep before the pointopoint command resolved the issue, suggesting some sort of race condition (like the tun0 device hadn't fully come up)

We have been running this implementation across a number of production servers, and didn't notice any issue until recently (~1mo ago).

collin-bachi-sp avatar May 08 '18 19:05 collin-bachi-sp

Is it possible to do so using the ip command, instead of ifconfig? ip generally issues a single netlink request (e.g. create device, set up, add address), so it's easier to see which step is failing.

squeed avatar May 08 '18 19:05 squeed

May 08 19:54:00 ip-10-0-1-98 docker[4475]: Tue May  8 19:54:00 2018 TUN/TAP device tun0 opened
May 08 19:54:00 ip-10-0-1-98 docker[4475]: Tue May  8 19:54:00 2018 TUN/TAP TX queue length set to 100
May 08 19:54:00 ip-10-0-1-98 docker[4475]: Tue May  8 19:54:00 2018 do_ifconfig, tt->did_ifconfig_ipv6_setup=0
May 08 19:54:00 ip-10-0-1-98 docker[4475]: Tue May  8 19:54:00 2018 /sbin/ip link set dev tun0 up mtu 1500
May 08 19:54:00 ip-10-0-1-98 docker[4475]: Tue May  8 19:54:00 2018 /sbin/ip addr add dev tun0 local 172.16.1.18 peer 172.16.1.17

This output is from openvpn running in an Alpine container. It defaulted to using ip here (and defaulted to ifconfig in the ubuntu container.)

I observed the same intermittent failures and successes with this implementation. The logging here is unreliable (async). There are some errors below: May 08 19:55:00 ip-10-0-1-98 docker[4475]: RTNETLINK answers: Network unreachable But, I don't know if they correspond with the ip addr add command. I do know that that command was unsuccessful, due to the state of the tun0 interface via the ifconfig output.

collin-bachi-sp avatar May 08 '18 19:05 collin-bachi-sp

@collin-bachi-sp did this use to work with a previous ContainerLinux release? If so, what was the last working kernel? Can you try to check if the same issue happens on latest alpha release (kernel 4.16)?

lucab avatar May 09 '18 09:05 lucab

I had the same problem and for me that workaround works: https://github.com/kylemanna/docker-openvpn/issues/370

skopciewski avatar May 11 '18 20:05 skopciewski

@lucab i have the same issue with a different container (zerotier), tried the latest alpha (1786.2.0) but still the same problem

initial assignment works on joining a network (because the interface comes up and gets the ip when the client is authorized "later")
but on a restart of the service it is not able to assign the ip any more

containers i tried: zerotier/zerotier-containerized and zyclonite/zerotier

zyclonite avatar Jun 04 '18 07:06 zyclonite

Are you still seeing this on current releases of Container Linux?

bgilbert avatar Oct 06 '18 03:10 bgilbert

yes, it still happened, the only difference was that directly after the update "boot" the interface came up but after restarting only the process it failed (stable branch)

zyclonite avatar Oct 06 '18 08:10 zyclonite

I also still have that problem

ID=coreos
VERSION=1855.4.0

skopciewski avatar Oct 10 '18 09:10 skopciewski

Same here..

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1855.4.0
VERSION_ID=1855.4.0
BUILD_ID=2018-09-11-0003
PRETTY_NAME="Container Linux by CoreOS 1855.4.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

andrejvanderzee avatar Oct 22 '18 11:10 andrejvanderzee

This is still with us.

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=2135.4.0
VERSION_ID=2135.4.0
BUILD_ID=2019-06-24-2257
PRETTY_NAME="Container Linux by CoreOS 2135.4.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

akunszt avatar Jul 12 '19 13:07 akunszt