DCO Netlink Message Mix-up on 2.6 (Fixed in 2.7 with OpenVPN#793)
Describe the bug We are seeing one of the behaviors described in OpenVPN#793 as a possible other symptom of the same "netlink message async receive mixup". With OpenVPN 2.6.15 (the dsommers RPM) running DCO (on Alma 9.6) when clients disconnect from the server there is a chance that the stats receive path intercepts messages related to the disconnect and the disconnect is never properly processed by user space. This means disconnect scripts and log messages are never invoked and that the client is still seen as connected from the user space perspective.
We see kernel DCO logs indicating the peer was deleted on the kernel side, but user space keeps the connection around indefinitely. There is a netlink error (-4): Try again that's reported, likely because the sequence is send/recv for stats (gets all netlink messages) but poll already said to read the netlink socket for the disconnect, so that happens too. When that happens, there's no message to read and -NLE_AGAIN is returned. Intermixing OpenVPN logs from verb 4 and kernel logs with tail on both while 5 UDP clients disconnect then the keep alive pings time out:
...
2025-10-30 13:30:09 us=577611 device-116/147.136.246.4:45347 Data Channel: cipher 'AES-256-GCM', peer-id: 0
2025-10-30 13:30:09 us=577684 device-116/147.136.246.4:45347 Timers: ping 2, ping-restart 20
2025-10-30 13:30:09 us=577703 device-116/147.136.246.4:45347 Protocol options: explicit-exit-notify 1, protocol-flags cc-exit tls-ekm dyn-tls-crypt
2025-10-30 13:30:10 us=151461 device-117/147.136.246.4:26209 Data Channel: cipher 'AES-256-GCM', peer-id: 0
2025-10-30 13:30:10 us=151537 device-117/147.136.246.4:26209 Timers: ping 2, ping-restart 20
2025-10-30 13:30:10 us=151552 device-117/147.136.246.4:26209 Protocol options: explicit-exit-notify 1, protocol-flags cc-exit tls-ekm dyn-tls-crypt
2025-10-30 13:30:33 us=953242 dco_do_read: netlink reports error (-4): Try again
Oct 30 13:30:33 tc3-vpn-0003 kernel: tun15: deleting peer with id 3, reason 2
2025-10-30 13:30:34 us=976885 device-115/147.136.246.4:19379 SIGTERM[soft,ovpn-dco: ping expired] received, client-instance exiting
Oct 30 13:30:34 tc3-vpn-0003 kernel: tun15: deleting peer with id 4, reason 2
2025-10-30 13:30:35 us=489924 device-114/147.136.246.4:54055 SIGTERM[soft,ovpn-dco: ping expired] received, client-instance exiting
Oct 30 13:30:35 tc3-vpn-0003 kernel: tun15: deleting peer with id 6, reason 2
Oct 30 13:30:35 tc3-vpn-0003 kernel: tun15: deleting peer with id 0, reason 2
2025-10-30 13:30:36 us=932 device-117/147.136.246.4:26209 SIGTERM[soft,ovpn-dco: ping expired] received, client-instance exiting
Oct 30 13:30:36 tc3-vpn-0003 kernel: tun15: deleting peer with id 2, reason 2
There are only 3 ping expired notices but 5 peers deleted and the status file still shows 2 connected devices that won't ever clean up. Turning up more verbose logging it's clear that dco_do_read is called right before the Try again but fails. Adding loops to repeatedly try or increasing message buffer sizes did not resolve that issue. Adding some extra logging shows unusual error (without log) paths being taken in the stats gathering receive callback.
To Reproduce
- Set up OpenVPN DCO on a Linux server; we have status output every 10s and relatively aggressive keepalives:
local 100.97.0.3
port 1194
proto udp
dev tun15
topology subnet
tun-mtu 1400
dh none
keepalive 2 10
max-clients 511
user openvpn
group openvpn
persist-key
persist-tun
status /var/log/openvpn/status/openvpn-status.log 10
verb 4
explicit-exit-notify 1
server 172.22.0.0 255.255.254.0 nopool
client-config-dir /etc/openvpn/ccd/
push "redirect-gateway def1 bypass-dhcp ipv6"
push "block-ipv6"
push "dhcp-option DNS 8.8.8.8"
push "dhcp-option DNS 1.1.1.1"
push "dhcp-option DNS 8.8.4.4"
push "dhcp-option DNS 1.0.0.1"
ccd-exclusive
management /run/openvpn-server/management.sock unix
<cert>
...
- Loop over connecting multiple clients simultaneously (we used docker containers) and then disconnect them all at about the same time (we just put a "sleep 5" then killed the client process)
- Watch logs and status files
Expected behavior All devices to be disconnected in user space properly.
Version information (please complete the following information):
- OS: Alma 9.6
- OpenVPN version: 2.6.15-1 (dsommers RPM from Copr)
Additional context This was briefly discussed on IRC with @cron2 - he asked for a ticket and @ordex to get involved. I will be trying to cherry pick a couple of commits (a699681 and f353b71) to see if that creates a compiling/working binary since f353b71 resolved this same issue already for the 2.7 code base and will report back if that was successful.
Technically the bug should be the same as the one in 2.7. I don't recall why we decided to not backport it to 2.6. But please, feel free to backport those commits and give it a try. Thanks!
I did get a back-port of two commits and a part of a third and it did resolve one of the issues (wrong callbacks being used). However there's a related issue still open as of today on both 2.6 and 2.7:
From multi.c just looking at the stats path from timers (and I'm pretty sure this same issue exists on many paths): The event loop calls:
multi_process_per_second_timers -> multi_process_per_second_timers_dowork -> multi_print_status
multi_print_status jumps to linux_dco.c to call
dco_get_peer_stats_multi -> dco_get_peer -> ovpn_nl_msg_send -> ovpn_nl_recvmsgs -> nl_recvmsgs
At that point if there are multiple messages not received, the callback is called multiple times -- once per message. The issue is the callback (ovpn_handle_msg) calls e.g. ovpn_handle_peer_del_ntf (or ovpn_handle_peer_float_ntf or ovpn_handle_key_swap_ntf) and those all just set flags on the dco object. Those flags are normally handled by multi_process_incoming_dco (via switch (dco->dco_del_peer_reason)) but if you track the call chain above back through from the callback, multi_process_incoming_dco isn't anywhere in the process.
This can be seen playing out when I add extra log statements to track function enter/exits even at just verb 6:
...
...
2025-10-30 23:26:12 us=685802 device-117/147.136.246.4:50604 peer-id=4 Data Channel: cipher 'AES-256-GCM', peer-id: 0
2025-10-30 23:26:12 us=685846 device-117/147.136.246.4:50604 peer-id=4 Timers: ping 2, ping-restart 20
2025-10-30 23:26:12 us=685857 device-117/147.136.246.4:50604 peer-id=4 Protocol options: explicit-exit-notify 1, protocol-flags cc-exit tls-ekm dyn-tls-crypt
2025-10-30 23:26:23 us=926455 dco_get_peer_stats_multi
2025-10-30 23:26:23 us=926530 dco_get_peer_stats_multi: entering ovpn_nl_msg_send
2025-10-30 23:26:23 us=926594 dco_get_peer_stats_multi: entering ovpn_nl_recvmsgs
2025-10-30 23:26:23 us=926635 ovpn-dco: received netlink message type=35 cmd=7 flags=0x0002
2025-10-30 23:26:23 us=926654 ovpn-dco: Received message without ifindex
2025-10-30 23:26:23 us=926665 ovpn-dco: received netlink message type=35 cmd=7 flags=0x0002
2025-10-30 23:26:23 us=926678 ovpn-dco: Received message without ifindex
2025-10-30 23:26:23 us=926689 ovpn-dco: received netlink message type=35 cmd=7 flags=0x0002
2025-10-30 23:26:23 us=926699 ovpn-dco: Received message without ifindex
2025-10-30 23:26:23 us=926708 ovpn-dco: received netlink message type=35 cmd=7 flags=0x0002
2025-10-30 23:26:23 us=926724 ovpn-dco: Received message without ifindex
2025-10-30 23:26:23 us=926738 ovpn-dco: received netlink message type=35 cmd=7 flags=0x0002
2025-10-30 23:26:23 us=926747 ovpn-dco: Received message without ifindex
2025-10-30 23:26:23 us=926762 dco_get_peer_stats_multi: leaving ovpn_nl_recvmsgs
2025-10-30 23:26:23 us=926771 dco_get_peer_stats_multi: leaving ovpn_nl_msg_send
2025-10-30 23:26:23 us=926798 leaving dco_get_peer_stats_multi
2025-10-30 23:26:35 us=616884 dco_get_peer_stats_multi
2025-10-30 23:26:35 us=616987 dco_get_peer_stats_multi: entering ovpn_nl_msg_send
2025-10-30 23:26:35 us=617063 dco_get_peer_stats_multi: entering ovpn_nl_recvmsgs
2025-10-30 23:26:35 us=617120 ovpn-dco: received netlink message type=35 cmd=3 flags=0000
2025-10-30 23:26:35 us=617140 ovpn-dco: received CMD_DEL_PEER, ifindex: 3, peer-id 0, reason: 2
2025-10-30 23:26:35 us=617164 dco_get_peer_stats_multi: leaving ovpn_nl_recvmsgs
2025-10-30 23:26:35 us=617179 dco_get_peer_stats_multi: entering ovpn_nl_recvmsgs
2025-10-30 23:26:35 us=617200 ovpn-dco: received netlink message type=35 cmd=3 flags=0000
2025-10-30 23:26:35 us=617215 ovpn-dco: received CMD_DEL_PEER, ifindex: 3, peer-id 2, reason: 2
2025-10-30 23:26:35 us=617230 dco_get_peer_stats_multi: leaving ovpn_nl_recvmsgs
2025-10-30 23:26:35 us=617244 dco_get_peer_stats_multi: entering ovpn_nl_recvmsgs
2025-10-30 23:26:35 us=617267 ovpn-dco: received netlink message type=35 cmd=7 flags=0x0002
2025-10-30 23:26:35 us=617281 ovpn-dco: Received message without ifindex
2025-10-30 23:26:35 us=617297 ovpn-dco: received netlink message type=35 cmd=7 flags=0x0002
2025-10-30 23:26:35 us=617310 ovpn-dco: Received message without ifindex
2025-10-30 23:26:35 us=617325 ovpn-dco: received netlink message type=35 cmd=7 flags=0x0002
2025-10-30 23:26:35 us=617340 ovpn-dco: Received message without ifindex
2025-10-30 23:26:35 us=617361 dco_get_peer_stats_multi: leaving ovpn_nl_recvmsgs
2025-10-30 23:26:35 us=617385 dco_get_peer_stats_multi: leaving ovpn_nl_msg_send
2025-10-30 23:26:35 us=617400 leaving dco_get_peer_stats_multi
2025-10-30 23:26:35 us=617677 dco_do_read
2025-10-30 23:26:35 us=617698 dco_do_read: entering ovpn_nl_recvmsgs
2025-10-30 23:26:35 us=617716 dco_do_read: netlink reports error (-4): Try again
2025-10-30 23:26:35 us=617731 dco_do_read: leaving ovpn_nl_recvmsgs
2025-10-30 23:26:35 us=617751 device-115/147.136.246.4:27394 peer-id=2 SIGTERM[soft,ovpn-dco: ping expired] received, client-instance exiting
2025-10-30 23:26:35 us=617848 dco_get_peer_stats_multi
2025-10-30 23:26:35 us=617865 dco_get_peer_stats_multi: entering ovpn_nl_msg_send
2025-10-30 23:26:35 us=617906 dco_get_peer_stats_multi: entering ovpn_nl_recvmsgs
2025-10-30 23:26:35 us=617953 ovpn-dco: received netlink message type=35 cmd=7 flags=0x0002
2025-10-30 23:26:35 us=617970 ovpn-dco: Received message without ifindex
2025-10-30 23:26:35 us=617980 ovpn-dco: received netlink message type=35 cmd=7 flags=0x0002
2025-10-30 23:26:35 us=617997 ovpn-dco: Received message without ifindex
2025-10-30 23:26:35 us=618016 ovpn-dco: received netlink message type=35 cmd=7 flags=0x0002
2025-10-30 23:26:35 us=618034 ovpn-dco: Received message without ifindex
2025-10-30 23:26:35 us=618062 dco_get_peer_stats_multi: leaving ovpn_nl_recvmsgs
2025-10-30 23:26:35 us=618080 dco_get_peer_stats_multi: leaving ovpn_nl_msg_send
2025-10-30 23:26:35 us=618096 leaving dco_get_peer_stats_multi
2025-10-30 23:26:36 us=128915 dco_do_read
2025-10-30 23:26:36 us=129011 dco_do_read: entering ovpn_nl_recvmsgs
2025-10-30 23:26:36 us=129048 ovpn-dco: received netlink message type=35 cmd=3 flags=0000
2025-10-30 23:26:36 us=129067 ovpn-dco: received CMD_DEL_PEER, ifindex: 3, peer-id 1, reason: 2
2025-10-30 23:26:36 us=129083 dco_do_read: leaving ovpn_nl_recvmsgs
2025-10-30 23:26:36 us=129100 device-114/147.136.246.4:23602 peer-id=1 SIGTERM[soft,ovpn-dco: ping expired] received, client-instance exiting
2025-10-30 23:26:36 us=129136 dco_get_peer_stats_multi
2025-10-30 23:26:36 us=129163 dco_get_peer_stats_multi: entering ovpn_nl_msg_send
2025-10-30 23:26:36 us=129233 dco_get_peer_stats_multi: entering ovpn_nl_recvmsgs
2025-10-30 23:26:36 us=129256 ovpn-dco: received netlink message type=35 cmd=3 flags=0000
2025-10-30 23:26:36 us=129273 ovpn-dco: received CMD_DEL_PEER, ifindex: 3, peer-id 3, reason: 2
2025-10-30 23:26:36 us=129288 dco_get_peer_stats_multi: leaving ovpn_nl_recvmsgs
2025-10-30 23:26:36 us=129306 dco_get_peer_stats_multi: entering ovpn_nl_recvmsgs
2025-10-30 23:26:36 us=129325 ovpn-dco: received netlink message type=35 cmd=3 flags=0000
2025-10-30 23:26:36 us=129340 ovpn-dco: received CMD_DEL_PEER, ifindex: 3, peer-id 4, reason: 2
2025-10-30 23:26:36 us=129354 dco_get_peer_stats_multi: leaving ovpn_nl_recvmsgs
2025-10-30 23:26:36 us=129364 dco_get_peer_stats_multi: entering ovpn_nl_recvmsgs
2025-10-30 23:26:36 us=129384 dco_get_peer_stats_multi: leaving ovpn_nl_recvmsgs
2025-10-30 23:26:36 us=129401 dco_get_peer_stats_multi: leaving ovpn_nl_msg_send
2025-10-30 23:26:36 us=129419 leaving dco_get_peer_stats_multi
In this case dco_get_peer_stats_multi at 23:26:23 first gets 5 messages (one per client) without ifindex - not sure what's up there. Then dco_get_peer_stats_multi is called again at 23:26:35 and gets a mix of messages including CMD_DEL_PEER. But the return path through the call chain as laid out at the top of this comment doesn't go through anything that handles the fields set by CMD_DEL_PEER, so nothing happens as a result.
I added a new commit to my fork that resolves the multi-callback issue. It does seem to properly disconnect devices fully now; I can no longer reproduce the "some devices stick around forever in user space" issue I was seeing before. That may be considered a "new bug" and warrant a new issue for tracking and looks like it needs resolved for both 2.6 and 2.7; please reach out if I can do anything else to help get these bug fixes resolved but at this point I believe the core OpenVPN team may be better equipped to finalize this work.
PR updated to just address this issue; the other issue I found is now #900