tinc
tinc copied to clipboard
SegFault when using TunnelServer=yes
I have a network with about ~800. The network is a mix of tinc 1.0 and 1.1 nodes. It is gradually expanding for several years now.
The problem is that at some point it seams the daemon can not handle the processing of the new connection and the edges.
There are 3 major nodes in the system and every other node initially makes connection to one of them.
Now after a lot of debugging I've limited to all nodes to connect only to one node, and use iptables to grant new connections gradually. last limit was 5 per minute.
I've started to monitor how the edges are growing on the main node and I see that although I've limited the connections on the other 2 major nodes at some point there are rapid spikes in the edges when new connection is established. So my guess is that the other nodes have a previous state on the edges when they try to push it, that is causing the main nodes to become overwhelmed.
So I've decided to put TunnelServer=yes on the major nodes so they don't propagate the connections on the other nodes.
However I get a segfault soon after starting on each node that I enable that option.
I've build from the latest code and here is a trace of such a run: (this is not from a "major" node, but the effect is the same)
Got ANS_KEY from Backbone (xxx.xxx.xxx.xxx port 655): 16 Office
Lukav_Beast
52201D7CFDC2C7E1FD7871A36E651B7AC24A52B4ED892CD953397F6BA859AB22D5D4CB235B9CF85910B6BDE91A34C85E
427 672 4 0 yyy.yyy.yyy.yyy 13935
Using reflexive UDP address from Office: yyy.yyy.yyy.yyy port 13935
UDP address of Office set to yyy.yyy.yyy.yyy port 13935
Got REQ_KEY from Backbone (xxx.xxx.xxx.xxx port 655): 15 Office Lukav_Beast
Program received signal SIGSEGV, Segmentation fault.
0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at
protocol_key.c:382
382 return send_request(to->nexthop->connection, "%d %s %s %s %d
%d %d %d", ANS_KEY,
(gdb) bt
#0 0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at
protocol_key.c:382
#1 0x000055555556e169 in req_key_h (c=0x555555851be0,
request=0x555555854bb7 "15 Office Lukav_Beast") at protocol_key.c:304
#2 0x000055555556a083 in receive_request (c=c at entry=0x555555851be0,
request=0x555555854bb7 "15 Office Lukav_Beast") at protocol.c:146
#3 0x000055555555e993 in receive_meta (c=c at entry=0x555555851be0) at
meta.c:333
#4 0x00005555555603f9 in handle_meta_connection_data
(c=c at entry=0x555555851be0) at net.c:304
#5 0x00005555555678c2 in handle_meta_io (data=0x555555851be0,
flags=<optimized out>) at net_socket.c:520
#6 0x000055555555c60a in event_loop () at event.c:359
#7 0x00005555555607f2 in main_loop () at net.c:510
#8 0x0000555555559208 in main (argc=6, argv=<optimized out>) at tincd.c:558
(gdb) bt full
#0 0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at
protocol_key.c:382
keylen = <optimized out>
key =
"527E64B1DB47F2F527ADF7F609498FFCB4807AEC3CD49697D3D8D870619BC537E1B7C403875D81FC608A8F6E00D06063\000\306\377\377\377\177\000\000\331\334VUUU",
'\000' <repeats 11 times>,
"*ֲ\322\316\000\305\000\000\000\000\000\000\000\000\340\033\205UUU\000\000\001\000\000\000\000\000\000\000P\316\377\377\377\177\000\000\267K\205UUU\000\000`\020\205UUU\000\000@\306\377\377\377\177\000\000i\341VUUU\000\000\000\000\000\000\377\177\000\000\000\000\000\000\000\000\000\000"...
#1 0x000055555556e169 in req_key_h (c=0x555555851be0,
request=0x555555854bb7 "15 Office Lukav_Beast") at protocol_key.c:304
from_name = "Office\000\061\071.130", '\000' <repeats 1003
times>...
to_name = "Lukav_Beast", '\000' <repeats 366 times>...
from = 0x555555851060
to = <optimized out>
reqno = 0
#2 0x000055555556a083 in receive_request (c=c at entry=0x555555851be0,
request=0x555555854bb7 "15 Office Lukav_Beast") at protocol.c:146
reqno = <optimized out>
#3 0x000055555555e993 in receive_meta (c=c at entry=0x555555851be0) at
meta.c:333
result = <optimized out>
request = <optimized out>
inlen = 0
inbuf =
"a\354\357\063J\363{\346d\177\271\371;+\212\371zFDt\271\061\370\ao\373\326\035\255=Α\254\257:\245\322ү\vƦ\205\035\336?1\234\372\001\004\063\323\t\004-\b8\367\f\201\342\304g\332\361jL76C\340-\t\006\210\214\314,C\352)ͺa\314\fAe\260\226\313\337\360|\256\236\263\344\205\061\207\303\t<\016\351\360\222\343[\317o\377\065<Ή?b(\267\321\356\360\242p$\314`\325ʆ\001|\036\204'\\\205i\314W\356#N4\000q\320\300\344\071\060\236w\016\306[\323X]\237\321\347\177\313KU\367ޚ\b}\307\374\367\032c\036\332:\307\367\265o\307Ƒ\212J\006NJ3!\305q\367\255\263\246\200i\035\327͌\001"...
bufp = 0x7fffffffd6f0
"a\354\357\063J\363{\346d\177\271\371;+\212\371zFDt\271\061\370\ao\373\326\035\255=Α\254\257:\245\322ү\vƦ\205\035\336?1\234\372\001\004\063\323\t\004-\b8\367\f\201\342\304g\332\361jL76C\340-\t\006\210\214\314,C\352)ͺa\314\fAe\260\226\313\337\360|\256\236\263\344\205\061\207\303\t<\016\351\360\222\343[\317o\377\065<Ή?b(\267\321\356\360\242p$\314`\325ʆ\001|\036\204'\\\205i\314W\356#N4"
endp = <optimized out>
#4 0x00005555555603f9 in handle_meta_connection_data
(c=c at entry=0x555555851be0) at net.c:304
No locals.
#5 0x00005555555678c2 in handle_meta_io (data=0x555555851be0,
flags=<optimized out>) at net_socket.c:520
c = 0x555555851be0
socket_error = <optimized out>
len = <optimized out>
#6 0x000055555555c60a in event_loop () at event.c:359
node = 0x555555797dd8 <signalio+24>
next = 0x555555797dd8 <signalio+24>
---Type <return> to continue, or q <return> to quit---
io = 0x555555851d90
tv = <optimized out>
fds = <optimized out>
curgen = 7
diff = {tv_sec = 0, tv_usec = 512516}
n = <optimized out>
readable = {fds_bits = {256, 0 <repeats 15 times>}}
writable = {fds_bits = {0 <repeats 16 times>}}
#7 0x00005555555607f2 in main_loop () at net.c:510
sighup = {signum = 1, cb = 0x555555560480 <sighup_handler>,
data = 0x7fffffffe1a0, node = {next = 0x7fffffffe2a8, prev = 0x0,
parent = 0x7fffffffe2a8, left = 0x0, right = 0x0, data =
0x7fffffffe1a0}}
sigterm = {signum = 15, cb = 0x55555555f900 <sigterm_handler>,
data = 0x7fffffffe1f0, node = {next = 0x0, prev = 0x7fffffffe2f8,
parent = 0x7fffffffe2f8, left = 0x0, right = 0x0, data =
0x7fffffffe1f0}}
sigquit = {signum = 3, cb = 0x55555555f900 <sigterm_handler>,
data = 0x7fffffffe240, node = {next = 0x7fffffffe2f8,
prev = 0x7fffffffe2a8, parent = 0x7fffffffe2f8, left =
0x7fffffffe2a8, right = 0x0, data = 0x7fffffffe240}}
sigint = {signum = 2, cb = 0x55555555f900 <sigterm_handler>,
data = 0x7fffffffe290, node = {next = 0x7fffffffe258,
prev = 0x7fffffffe1b8, parent = 0x7fffffffe258, left =
0x7fffffffe1b8, right = 0x0, data = 0x7fffffffe290}}
sigalrm = {signum = 14, cb = 0x5555555605b0 <sigalrm_handler>,
data = 0x7fffffffe2e0, node = {next = 0x7fffffffe208,
prev = 0x7fffffffe258, parent = 0x0, left = 0x7fffffffe258,
right = 0x7fffffffe208, data = 0x7fffffffe2e0}}
#8 0x0000555555559208 in main (argc=6, argv=<optimized out>) at tincd.c:558
umbstr = <optimized out>
priority = 0x0
Any help is much appreciated since my network is unusable at the moment
@lukavia , which nodes are crashing? 1.1 nodes or 1.0 nodes? Have you got the exact version number of the nodes that are crashing?
@gsliepen , are 1.1 nodes meant to be compatible with 1.0 nodes?
Version: 1.1~pre17-1.2~bpo10+1
Some time has passed but I think the problem also exists with only 1.1 nodes.
However I've migrated to another VPN solution and cannot assist with tests anymore. Sorry.
Best of luck
This might be fixed by commit ed070d754d1b5500b0ec3615ae342178cfd42efb.