kamailio
kamailio copied to clipboard
Unexpected kamailio 5.4 segfault related to dialog variables
Dear all we have seen a core in a running production kamailio instance today, i attach you the bt in dropbox version: kamailio 5.4.4 (x86_64/linux) 0ddb01-dirty centos7 3.10.0-693.17.1.el7.x86_64
https://www.dropbox.com/s/fpunfs0o9aqhwww/kamailio_core_17agosto.txt?dl=0
#0 0x00007f4c131446ad in print_lists (dlg=0x7f4b60abb2e8) at dlg_var.c:277
#1 0x00007f4c131467ff in pv_set_dlg_variable (msg=0x7f4c17592d70, param=0x7f4c17aa88d0, op=254, val=0x7ffc8de8c990) at dlg_var.c:443
#2 0x00000000004f9c82 in lval_pvar_assign (h=0x7ffc8de8ddc0, msg=0x7f4c17592d70, lv=0x7f4c17aa8998, rv=0x7f4c17aa8b18) at core/lvalue.c:352
#3 0x00000000004fa780 in lval_assign (h=0x7ffc8de8ddc0, msg=0x7f4c17592d70, lv=0x7f4c17aa8998, rve=0x7f4c17aa8b10) at core/lvalue.c:400
#4 0x00000000004c3f6f in do_action (h=0x7ffc8de8ddc0, a=0x7f4c17aa92b0, msg=0x7f4c17592d70) at core/action.c:1455
#5 0x00000000004c5cab in run_actions (h=0x7ffc8de8ddc0, a=0x7f4c17aa8668, msg=0x7f4c17592d70) at core/action.c:1581
#6 0x00000000004b8867 in do_action (h=0x7ffc8de8ddc0, a=0x7f4c17aac2b0, msg=0x7f4c17592d70) at core/action.c:1067
#7 0x00000000004c5cab in run_actions (h=0x7ffc8de8ddc0, a=0x7f4c17aa34c0, msg=0x7f4c17592d70) at core/action.c:1581
#8 0x00000000004c63ef in run_top_route (a=0x7f4c17aa34c0, msg=0x7f4c17592d70, c=0x7ffc8de8ddc0) at core/action.c:1666
#9 0x00007f4c16bbc0a2 in reply_received (p_msg=0x7f4c17592d70) at t_reply.c:2543
#10 0x00000000005d0b69 in do_forward_reply (msg=0x7f4c17592d70, mode=0) at core/forward.c:764
#11 0x00000000005d283b in forward_reply (msg=0x7f4c17592d70) at core/forward.c:865
#12 0x0000000000622dd2 in receive_msg (
buf=0xb034a0 <buf.7140> "SIP/2.0 100 Trying\r\nVia: SIP/2.0/UDP 79.170.68.186;branch=z9hG4bK895d.0e2ee1cb7a0cde9ed9e78d7f0ec22c00.0\r\nVia: SIP/2.0/UDP 79.170.71.172:5080;received=79.170.71.172;rport=5080;branch=z9hG4bKtmtBy7jaQ5"..., len=449, rcv_info=0x7ffc8de8e8b0) at core/receive.c:509
#13 0x0000000000521e67 in udp_rcv_loop () at core/udp_server.c:543
#14 0x0000000000429f56 in main_loop () at main.c:1689
#15 0x0000000000434034 in main (argc=13, argv=0x7ffc8de8f148) at main.c:2876
Best regards, Noel
Thanks for the report. From the backtrace it looks related to dialog variables in reply handling. Do you have some special operations there in your cfg?
hello Henning
for a 100 trying which seems the message that caused the error we do tome dlg var assignments
$dlg_var(trying_time) = $TV(sn);
$var(response) = $null;
jansson_set("real", "ts", "$TV(Sn)", "$var(response)");
jansson_set("int", "status", "$T_reply_code", "$var(response)");
jansson_append("obj","","$var(response)","$dlg_var(response_list)");
we use the dlg variables quite a lot in general, but those are the operations to be made with a 100trying in general
Best regards, David
hello all
today we had another core on another kamailio instance, on different function, but also seen something like a loop with the do_action and run_actions functions.
#0 0x00007f75446930c0 in get_profile_size (profile=0x7f747e6d29d8, value=0x7fff6dc6c7c0) at dlg_profile.c:853
#1 0x00007f75446fbe3a in w_get_profile_size_helper (msg=0x7f75499bbb60, profile=0x7f747e6d29d8, value=0x7fff6dc6c7c0, spd=0x7f754a289bd0) at dialog.c:941
#2 0x00007f75446fc62c in w_get_profile_size3 (msg=0x7f75499bbb60, profile=0x7f747e6d29d8 "\b+m~t\177", value=0x7f754a289b48 "0\314/Ju\177", result=0x7f754a289bd0 "\017") at dialog.c:982
#3 0x00000000004660fa in do_action (h=0x7fff6dc6d0a0, a=0x7f754a2fed30, msg=0x7f75499bbb60) at core/action.c:1094
#4 0x00000000004732fe in run_actions (h=0x7fff6dc6d0a0, a=0x7f754a2fed30, msg=0x7f75499bbb60) at core/action.c:1581
#5 0x000000000047397a in run_actions_safe (h=0x7fff6dc6f7d0, a=0x7f754a2fed30, msg=0x7f75499bbb60) at core/action.c:1645
#6 0x0000000000439cbb in rval_get_int (h=0x7fff6dc6f7d0, msg=0x7f75499bbb60, i=0x7fff6dc6d9cc, rv=0x7f754a2ff3c0, cache=0x0) at core/rvalue.c:915
#7 0x000000000043e7ec in rval_expr_eval_int (h=0x7fff6dc6f7d0, msg=0x7f75499bbb60, res=0x7fff6dc6d9cc, rve=0x7f754a2ff3b8) at core/rvalue.c:1913
#8 0x0000000000465a2c in do_action (h=0x7fff6dc6f7d0, a=0x7f754a301c28, msg=0x7f75499bbb60) at core/action.c:1052
#9 0x00000000004732fe in run_actions (h=0x7fff6dc6f7d0, a=0x7f754a301c28, msg=0x7f75499bbb60) at core/action.c:1581
#10 0x000000000046269b in do_action (h=0x7fff6dc6f7d0, a=0x7f754a2de750, msg=0x7f75499bbb60) at core/action.c:700
#11 0x00000000004732fe in run_actions (h=0x7fff6dc6f7d0, a=0x7f754a2de518, msg=0x7f75499bbb60) at core/action.c:1581
#12 0x0000000000465eba in do_action (h=0x7fff6dc6f7d0, a=0x7f754a2e00d8, msg=0x7f75499bbb60) at core/action.c:1067
#13 0x00000000004732fe in run_actions (h=0x7fff6dc6f7d0, a=0x7f754a2d9558, msg=0x7f75499bbb60) at core/action.c:1581
#14 0x0000000000465eba in do_action (h=0x7fff6dc6f7d0, a=0x7f754a2e0228, msg=0x7f75499bbb60) at core/action.c:1067
#15 0x00000000004732fe in run_actions (h=0x7fff6dc6f7d0, a=0x7f754a28bc90, msg=0x7f75499bbb60) at core/action.c:1581
#16 0x0000000000473a42 in run_top_route (a=0x7f754a28bc90, msg=0x7f75499bbb60, c=0x7fff6dc6f7d0) at core/action.c:1666
#17 0x00007f7548bc60a2 in reply_received (p_msg=0x7f75499bbb60) at t_reply.c:2543
#18 0x000000000056b5cd in do_forward_reply (msg=0x7f75499bbb60, mode=0) at core/forward.c:764
#19 0x000000000056d29f in forward_reply (msg=0x7f75499bbb60) at core/forward.c:865
#20 0x00000000005da5d8 in receive_msg (
buf=0xabec00 <buf.7140> "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 79.170.64.152;branch=z9hG4bK7df8.906db024287203657d118c15dcefd4e0.0\r\nVia: SIP/2.0/UDP 81.52.191.129:5060;branch=z9hG4bK0cBe4906599067bc4c6\r\nRecord-Route: <sip:79.170.6"..., len=1101, rcv_info=0x7fff6dc702c0) at core/receive.c:509
#21 0x00000000004cb2a8 in udp_rcv_loop () at core/udp_server.c:543
#22 0x0000000000429f56 in main_loop () at main.c:1689
#23 0x0000000000434034 in main (argc=15, argv=0x7fff6dc70b58) at main.c:2876
this indicates something like the process is busy with some other task? On the 200OK that caused the segfault on its related process, we called rtpengine_manage jsut before execute get_profile_size. It doesnt seem the rtpengine had any issue or was delaying the response to other requets from kamailio. Might this issue be related to the get_profile_size or maybe to the rtpengine_manage functions?
This last core might be related to the core which opened this issue and also the one which opened #3040?
Let me add the full backtrace core_mad_talos_1_btfull_26march2022.log
Please let me know if you might need something else regards david
Having the same issue, Kamailio is crashing randomly when is setting a dlg variable:
(gdb) bt
#0 0x00007f29e36314e5 in print_lists (dlg=0x7f29ee8d8830) at dlg_var.c:277
#1 0x00007f29e3633637 in pv_set_dlg_variable (msg=0x7f2a2a477ad8, param=0x7f2a2a3ef530, op=254, val=0x7ffe0d36a030)
at dlg_var.c:443
#2 0x00000000004b44db in lval_pvar_assign (h=0x7ffe0d36b4f0, msg=0x7f2a2a477ad8, lv=0x7f2a2a3ef5f8, rv=0x7f2a2a3f0028)
at core/lvalue.c:352
#3 0x00000000004b4fd9 in lval_assign (h=0x7ffe0d36b4f0, msg=0x7f2a2a477ad8, lv=0x7f2a2a3ef5f8, rve=0x7f2a2a3f0020)
at core/lvalue.c:400
#4 0x0000000000480500 in do_action (h=0x7ffe0d36b4f0, a=0x7f2a2a3ef940, msg=0x7f2a2a477ad8) at core/action.c:1458
#5 0x000000000048223c in run_actions (h=0x7ffe0d36b4f0, a=0x7f2a2a3ef940, msg=0x7f2a2a477ad8) at core/action.c:1584
#6 0x00000000004715d9 in do_action (h=0x7ffe0d36b4f0, a=0x7f2a2a3824c8, msg=0x7f2a2a477ad8) at core/action.c:703
#7 0x000000000048223c in run_actions (h=0x7ffe0d36b4f0, a=0x7f2a2a3824c8, msg=0x7f2a2a477ad8) at core/action.c:1584
#8 0x0000000000482980 in run_top_route (a=0x7f2a2a3824c8, msg=0x7f2a2a477ad8, c=0x7ffe0d36b4f0) at core/action.c:1669
#9 0x00000000005e68fd in receive_msg (
buf=0xad3c40 <buf.7141> "SIP/2.0 100 Trying\r\nVia: SIP/2.0/UDP 192.168.1.2;branch=z9hG4bKfacc.56b09f180a8f8bc7bfa1a859c6f1661e.0;i=9a3cc1;received=192.168.1.2;rport=5060\r\nVia: SIP/2.0/TCP 192.168.1.1;branch=z9hG4bKfacc.05601fa50c"..., len=1017,
rcv_info=0x7ffe0d36b990) at core/receive.c:493
#10 0x00000000004dc6b5 in udp_rcv_loop () at core/udp_server.c:543
#11 0x0000000000429f6b in main_loop () at main.c:1711
#12 0x000000000043424e in main (argc=13, argv=0x7ffe0d36c228) at main.c:2942
(gdb) info locals
varlist = 0x7574617473706973
__FUNCTION__ = "print_lists"
(gdb) list
272 while (varlist) {
273 LM_DBG("%.*s=%.*s (flags %i)\n",
274 varlist->key.len, varlist->key.s,
275 varlist->value.len, varlist->value.s,
276 varlist->vflags);
277 varlist = varlist->next;
278 }
279 }
280 }
281
(gdb) bt 1
#0 0x00007f29e36314e5 in print_lists (dlg=0x7f29ee8d8830) at dlg_var.c:277
(More stack frames follow...)
(gdb) bt 2
#0 0x00007f29e36314e5 in print_lists (dlg=0x7f29ee8d8830) at dlg_var.c:277
#1 0x00007f29e3633637 in pv_set_dlg_variable (msg=0x7f2a2a477ad8, param=0x7f2a2a3ef530, op=254, val=0x7ffe0d36a030)
at dlg_var.c:443
(More stack frames follow...)
version: kamailio 5.4.8 (x86_64/linux) 053b3d OS: CentOS 7 3.10.0-1160.45.1.el7.x86_64
Still happening to me, sometimes twice per day, and I cannot understand why, @miconda, @henningw - when you'll have some time could you please look into that, more details are below, I hope that this will help:
Core was generated by `/usr/local/sbin/kamailio -P /var/run/kamailio/kamailio.pid -f /usr/local/etc/ka'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fd96e8684e5 in print_lists (dlg=0x7fd97c996850) at dlg_var.c:277
277 varlist = varlist->next;
(gdb) bt
#0 0x00007fd96e8684e5 in print_lists (dlg=0x7fd97c996850) at dlg_var.c:277
#1 0x00007fd96e86a637 in pv_set_dlg_variable (msg=0x7fd9b56ad978, param=0x7fd9b5268db8, op=254, val=0x7ffc5b9dd850) at dlg_var.c:443
#2 0x00000000004b44db in lval_pvar_assign (h=0x7ffc5b9df4d0, msg=0x7fd9b56ad978, lv=0x7fd9b5268e80, rv=0x7fd9b52691b0) at core/lvalue.c:352
#3 0x00000000004b4fd9 in lval_assign (h=0x7ffc5b9df4d0, msg=0x7fd9b56ad978, lv=0x7fd9b5268e80, rve=0x7fd9b52691a8) at core/lvalue.c:400
#4 0x0000000000480500 in do_action (h=0x7ffc5b9df4d0, a=0x7fd9b52698d8, msg=0x7fd9b56ad978) at core/action.c:1458
#5 0x000000000048223c in run_actions (h=0x7ffc5b9df4d0, a=0x7fd9b5268b50, msg=0x7fd9b56ad978) at core/action.c:1584
#6 0x0000000000474df8 in do_action (h=0x7ffc5b9df4d0, a=0x7fd9b526d058, msg=0x7fd9b56ad978) at core/action.c:1070
#7 0x000000000048223c in run_actions (h=0x7ffc5b9df4d0, a=0x7fd9b5267340, msg=0x7fd9b56ad978) at core/action.c:1584
#8 0x00000000004715d9 in do_action (h=0x7ffc5b9df4d0, a=0x7fd9b5468870, msg=0x7fd9b56ad978) at core/action.c:703
#9 0x000000000048223c in run_actions (h=0x7ffc5b9df4d0, a=0x7fd9b5467f98, msg=0x7fd9b56ad978) at core/action.c:1584
#10 0x0000000000482980 in run_top_route (a=0x7fd9b5467f98, msg=0x7fd9b56ad978, c=0x7ffc5b9df4d0) at core/action.c:1669
#11 0x00007fd97068d1e5 in reply_received (p_msg=0x7fd9b56ad978) at t_reply.c:2543
#12 0x0000000000573f3d in do_forward_reply (msg=0x7fd9b56ad978, mode=0) at core/forward.c:764
#13 0x0000000000575c23 in forward_reply (msg=0x7fd9b56ad978) at core/forward.c:865
#14 0x00000000005e6f0b in receive_msg (buf=0xad3c40 <buf.7141> "SIP/2.0 487 Request Terminated\r\nTo: <sip:[email protected]:5060>;tag=cedbb55fde9a2d72i0\r\nFrom: \"022201201\" <sip:[email protected]>;tag=as080fece1\r\nCall-ID: 3c70f1ad5617828309e112"..., len=622,
rcv_info=0x7ffc5b9dffc0) at core/receive.c:509
#15 0x00000000004dc6b5 in udp_rcv_loop () at core/udp_server.c:543
#16 0x0000000000429f6b in main_loop () at main.c:1711
#17 0x000000000043424e in main (argc=13, argv=0x7ffc5b9e0858) at main.c:2942
(gdb) frame 0
#0 0x00007fd96e8684e5 in print_lists (dlg=0x7fd97c996850) at dlg_var.c:277
277 varlist = varlist->next;
(gdb) list
272 while (varlist) {
273 LM_DBG("%.*s=%.*s (flags %i)\n",
274 varlist->key.len, varlist->key.s,
275 varlist->value.len, varlist->value.s,
276 varlist->vflags);
277 varlist = varlist->next;
278 }
279 }
280 }
281
(gdb) info locals
varlist = 0x7574617473706973
__FUNCTION__ = "print_lists"
(gdb) p varlist
$1 = (struct dlg_var *) 0x7574617473706973
(gdb) p *varlist
Cannot access memory at address 0x7574617473706973
(gdb) frame 1
#1 0x00007fd96e86a637 in pv_set_dlg_variable (msg=0x7fd9b56ad978, param=0x7fd9b5268db8, op=254, val=0x7ffc5b9dd850) at dlg_var.c:443
443 print_lists(dlg);
(gdb) list
438 /* dlg_lock() / dlg_unlock() are reentrant */
439 update_dialog_dbinfo(dlg);
440 }
441 dlg_unlock(d_table, &(d_table->entries[dlg->h_entry]));
442 }
443 print_lists(dlg);
444
445 dlg_release(dlg);
446 return 0;
447 error:
(gdb) info locals
dlg = 0x7fd97c996850
ret = 0
__FUNCTION__ = "pv_set_dlg_variable"
(gdb) p dlg
$2 = (dlg_cell_t *) 0x7fd97c996850
(gdb) p *dlg
$3 = {ref = 2, next = 0x0, prev = 0x0, h_id = 7212, h_entry = 1387, state = 2, lifetime = 10800, init_ts = 1655889534, start_ts = 0, end_ts = 0, dflags = 512, iflags = 0, sflags = 0, toroute = 0, toroute_name = {s = 0x0, len = 0}, from_rr_nb = 0, tl = {next = 0x0, prev = 0x0, timeout = 0},
callid = {s = 0x7fd97c9969c0 "[email protected]", len = 55}, from_uri = {s = 0x7fd97c9969f8 "sip:[email protected]", len = 37}, to_uri = {s = 0x7fd97c996a1e "sip:[email protected]:5060", len = 31}, req_uri = {
s = 0x7fd97c996a3e "sip:[email protected]:48452", len = 34}, tag = {{s = 0x7fd9787d7f28 "as080fece1.64", len = 10}, {s = 0x0, len = 0}}, cseq = {{s = 0x7fd97fa7aa10 "102", len = 3}, {s = 0x0, len = 0}}, route_set = {{s = 0x7fd979276e50 "log", len = 0}, {s = 0x0, len = 0}},
contact = {{s = 0x7fd979731df8 "sip:[email protected]:50601\300\300\300\300", len = 31}, {s = 0x0, len = 0}}, bind_addr = {0x7fd9b51d6810, 0x0}, cbs = {first = 0x0, types = 0}, profile_links = 0x0, vars = 0x7fd97fea2830, ka_src_counter = 0, ka_dst_counter = 0}
(gdb)
(gdb) p full_version
$4 = "kamailio 5.4.8 (x86_64/linux) 604dc4"
(gdb)
I haven't implemented the dialog variables, nor use them in my deployments, but given several reports with issues related to them, I looked once quickly over them and I think they are exposed to races. I made a similar remark to a recent PR that was using dialog variables,
Unfortunately I had no time to dig in further, too many constraints with work and family due to pandemic and no deployment that reproduces similar crashes (again, I don't use dialog variables). Near future does not look better in terms of available time. If you or someone else can do it, review the code related to dialog variables. I think getting a variable just references to the share memory location, without any other protection (or cloning).
Alternative, for a while, is to see if you can replace dialog variables with something else, maybe you can use htable, and same same results.
I pushed some changes to dialog module related to how dialog variables are accessed, they are backported to 5.6. If there are still crashes with this new version, open another issue with backtrace corresponding to the updated code.
thanks Daniel
we will try it
Hello Daniel,
we tested it but seems there is an issue still with latest 5.6 to handle stirng to long conversion. So we will wait until it's addressed. By the way, those changes you made on the dialog module, could they be backported to other versions like 5.5 or 5.4?
thanks a lot and regards david