linux dco: restructure peer-counter-at-instance-exit logic
So I'm testing gerrit 1403, and I have this nice script which is based on the "gremlin" config, which fires up 50 client instances and kills them 20 seconds later, simultaneously
AMI=$$
mkdir g$AMI || exit 1
N=50
SLEEPTIMER=25
i=1
PIDLIST=""
PASS=good
while [ $i -le $N ]
do
echo "gremlin$AMI-$i" >g$AMI/aup-$i.txt
echo $PASS >>g$AMI/aup-$i.txt
openvpn --config gremlin.conf \
--auth-user-pass g$AMI/aup-$i.txt \
--log-append g$AMI/openvpn-$i.log \
--connect-retry-max 3 &
PIDLIST="$PIDLIST $!"
i=$(( $i + 1 ))
done
echo "$N instances started, sleeping $SLEEPTIMER..."
sleep $SLEEPTIMER
kill -INT $PIDLIST
echo "killed$PIDLIST, wait.."
wait
(the username dance is only there so I can get away with --duplicate-cn and still see different clients in the server log)
so it works (thanks), but I get a lot of these
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: ovpn_handle_peer: parsing message for peer 28...
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_update_peer_stat / dco_read_bytes(28): 656
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_update_peer_stat / dco_write_bytes(28): 320
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_update_peer_stat / tun_read_bytes(28): 0
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_update_peer_stat / tun_write_bytes(28): 0
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process: netlink reports error (-25): Object busy
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=-25)
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process: netlink reports error (-25): Object busy
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=-25)
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process: netlink reports error (-25): Object busy
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=-25)
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process: netlink reports error (-25): Object busy
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=-25)
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process: netlink reports error (-25): Object busy
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=-25)
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process: netlink reports error (-25): Object busy
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=-25)
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process: netlink reports error (-25): Object busy
Nov 27 18:33:41 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=-25)
(the START/END lines are my own doing, to better see if something was handled "in one call" or "multiple calls"), and also this, after the last client has been killed
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=0)
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=0)
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=0)
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=0)
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=0)
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=0)
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=0)
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=0)
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=0)
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=0)
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=0)
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process START
Nov 27 18:33:42 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_read_and_process END (ret=0)
... that is, 10+ events in the same second, with "nothing to do".
Both seem to be harmless (= the server has 0 pending clients, so nothing got lost, and behaves normally) but it looks not like it should behave...
Oh well, spoke too soon... this happened on the next run
Nov 27 18:41:28 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_get_peer: peer-id -1
Nov 27 18:41:28 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: ovpn-dco: received netlink message type=32 cmd=5 flags=0000
Nov 27 18:41:28 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: ovpn-dco: received CMD_PEER_DEL_NTF, ifindex: 8479, peer-id 23, reason: 2
Nov 27 18:41:28 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: gremlin4679-23/udp6:194.97.140.21:47712 peer-id=23 SIGTERM[soft,ovpn-dco: ping expired] received, client-instance exiting
Nov 27 18:41:28 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_get_peer: peer-id -1
Nov 27 18:41:28 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: dco_get_peer: netlink out of memory error
Nov 27 18:41:28 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: Exiting due to fatal error
Nov 27 18:41:28 ubuntu2004 tun-udp-p2mp-username-cn[1566721]: Closing ovpn-dco interface
Mmmmh.
Maybe this is not really working as perfectly as it should
dco_read_and_process START
ovpn-dco: received netlink message type=32 cmd=5 flags=0000
ovpn-dco: received CMD_PEER_DEL_NTF, ifindex: 8479, peer-id 8, reason: 2
gremlin3512-7/udp6:194.97.140.21:51901 peer-id=8 SIGTERM[soft,ovpn-dco: ping exp
ired] received, client-instance exiting
dco_get_peer: peer-id -1
ovpn-dco: received netlink message type=32 cmd=5 flags=0000
ovpn-dco: received CMD_PEER_DEL_NTF, ifindex: 8479, peer-id 3, reason: 2
gremlin3512-4/udp6:194.97.140.21:27749 peer-id=3 SIGTERM[soft,ovpn-dco: ping exp
ired] received, client-instance exiting
dco_get_peer: peer-id -1
ovpn-dco: received netlink message type=32 cmd=5 flags=0000
ovpn-dco: received CMD_PEER_DEL_NTF, ifindex: 8479, peer-id 1, reason: 2
gremlin3512-3/udp6:[2001:608:0:814::f000:21]:39156 peer-id=1 SIGTERM[soft,ovpn-d
co: ping expired] received, client-instance exiting
dco_get_peer: peer-id -1
ovpn-dco: received netlink message type=32 cmd=5 flags=0000
ovpn-dco: received CMD_PEER_DEL_NTF, ifindex: 8479, peer-id 6, reason: 2
gremlin3512-5/udp6:[2001:608:0:814::f000:21]:22841 peer-id=6 SIGTERM[soft,ovpn-d
co: ping expired] received, client-instance exiting
dco_get_peer: peer-id -1
ovpn-dco: received netlink message type=32 cmd=3 flags=0x0002
ovpn_handle_peer: parsing message for peer 2...
so it seems the "CMD_PEER_DEL_NTF" triggers a "dco_get_peer()" call (get last stats values?), and since we're actually busy reading from netlink, this might explain why it eventually overflows (Object busy) and even OOMs.
#0 dco_get_peer_stats_multi (dco=0x55c463febc78,
raise_sigusr1_on_err=raise_sigusr1_on_err@entry=false) at dco_linux.c:1223
#1 0x000055c4291b8390 in setenv_stats (c=0x55c4640b7b80, m=0x7ffc1fbe0850)
at multi.c:515
#2 multi_client_disconnect_setenv (mi=0x55c4640b79b0, m=0x7ffc1fbe0850)
at multi.c:532
#3 multi_client_disconnect_script (m=m@entry=0x7ffc1fbe0850,
mi=mi@entry=0x55c4640b79b0) at multi.c:541
#4 0x000055c4291baf63 in multi_close_instance (m=m@entry=0x7ffc1fbe0850,
mi=mi@entry=0x55c4640b79b0, shutdown=shutdown@entry=false) at multi.c:646
#5 0x000055c4291bb6fd in multi_close_instance_on_signal (
m=m@entry=0x7ffc1fbe0850, mi=mi@entry=0x55c4640b79b0) at multi.c:3211
#6 0x000055c4291bbb15 in multi_signal_instance (sig=15, mi=0x55c4640b79b0,
m=0x7ffc1fbe0850) at multi.c:3220
#7 process_incoming_del_peer (dco=0x55c463febc78, mi=0x55c4640b79b0,
m=0x7ffc1fbe0850) at multi.c:3263
#8 multi_process_incoming_dco (dco=0x55c463febc78) at multi.c:3289
#9 0x000055c42919262d in ovpn_handle_msg (msg=<optimized out>,
arg=0x55c463febc78) at dco_linux.c:1144
#10 0x00007f1f091fb2a1 in nl_recvmsgs_report ()
from /lib/x86_64-linux-gnu/libnl-3.so.200
#11 0x00007f1f091fb78d in nl_recvmsgs ()
from /lib/x86_64-linux-gnu/libnl-3.so.200
#12 0x000055c4291919f0 in ovpn_nl_recvmsgs (
prefix=prefix@entry=0x55c429272490 <__func__.21701> "dco_read_and_process", dco=<optimized out>, dco=<optimized out>) at dco_linux.c:130
#13 0x000055c429193e2a in dco_read_and_process (dco=0x55c463febc78)
at dco_linux.c:1159
#14 0x000055c4291c0e20 in multi_io_process_io (m=m@entry=0x7ffc1fbe0850)
at multi_io.c:508
#15 0x000055c4291bd938 in tunnel_server_loop (multi=0x7ffc1fbe0850)
at multi.c:4200
this is not healthy...
So, trying with a very simplistic "lock DCO against re-entry"
@@ -1151,12 +1151,17 @@ ovpn_handle_msg(struct nl_msg *msg, void *arg)
return NL_OK;
}
+bool _is_locked = false;
int
dco_read_and_process(dco_context_t *dco)
{
- msg(D_DCO_DEBUG, __func__);
+ msg(D_DCO_DEBUG, "%s START", __func__);
+ _is_locked = true;
- return ovpn_nl_recvmsgs(dco, __func__);
+ int ret = ovpn_nl_recvmsgs(dco, __func__);
+ msg(D_DCO_DEBUG, "%s END (ret=%d)", __func__, ret);
+ _is_locked = false;
+ return ret;
}
static int
dco_get_peer(dco_context_t *dco, int peer_id, const bool raise_sigusr1_on_err)
{
ASSERT(dco);
+ if (_is_locked)
+ {
+ msg(M_WARN, "%s: we're locked", __func__);
+ return 0;
+ }
+
it will now log the expected "we're locked" and no unexpected netlink thingies...
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: dco_read_and_process START
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: ovpn-dco: received netlink message type=32 cmd=5 flags=0000
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: ovpn-dco: received CMD_PEER_DEL_NTF, ifindex: 8517, peer-id 8, reason: 2
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: gremlin54433-9/udp6:194.97.140.21:25421 peer-id=8 SIGTERM[soft,ovpn-dco: ping expired] received, client-instance exiting
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: dco_get_peer_stats_multi
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: dco_get_peer: we're locked
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: register signal: SIGTERM (close_context)
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: dco_read_and_process END (ret=0)
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: dco_read_and_process START
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: ovpn-dco: received netlink message type=32 cmd=5 flags=0000
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: ovpn-dco: received CMD_PEER_DEL_NTF, ifindex: 8517, peer-id 19, reason: 2
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: gremlin54433-20/udp6:[2001:608:0:814::f000:21]:39091 peer-id=19 SIGTERM[soft,ovpn-dco: ping expired] received, client-instance exiting
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: dco_get_peer_stats_multi
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: dco_get_peer: we're locked
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: register signal: SIGTERM (close_context)
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: dco_read_and_process END (ret=0)
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: dco_read_and_process START
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: ovpn-dco: received netlink message type=32 cmd=5 flags=0000
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: ovpn-dco: received CMD_PEER_DEL_NTF, ifindex: 8517, peer-id 36, reason: 2
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: gremlin54433-37/udp6:194.97.140.21:18566 peer-id=36 SIGTERM[soft,ovpn-dco: ping expired] received, client-instance exiting
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: dco_get_peer_stats_multi
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: dco_get_peer: we're locked
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: register signal: SIGTERM (close_context)
Nov 27 19:22:41 ubuntu2004 tun-udp-p2mp-username-cn[1571367]: dco_read_and_process END (ret=0)
... but actually this does log something unexpected - now we're back to "one message per call to dco_read_and_process(). that was a bit unexpected...
So gerrit 1403 v8+ fixes this one, by ensuring that dco_get_peer() is never called while dco_read_and_process() is active (locking dco).
This is not the best solution - much better would be to have unsolicited counter updates on peer exit, so we do not need to query DCO in setenv_stats() at all. To be worked on...