freeswitch
freeswitch copied to clipboard
Internal Interface suddenly freezes
From time to time, in different intervals the internal sofia interface freezes and stops responding to register / subscribe requests. Sometimes it happens once in two days, sometimes it happens 3 or even 4 times a day. It seems like there is no time-pattern. The Interface seems to be frozen. "sofia profile internal restart" also gets stuck, once executed via fs_cli.
There is nothing in the logs, that would help solve the problem. When it gets a register-request, my register-lua should log something to the console, which just stops when this happens. After some time, without restarting it just starts working again. But i've seen this only once, after about two hours.
The only solution is to restart freeswitch. After the restart everything is working as expected.
The system is configured like this:
- The two standard interfaces, external and internal. External handles the provider connection, and internal handles clients, including register / presence and so on.
- The client-authentication is done via lua, which in essence connects to a database, gets the userdata and returns the xml. ( configured in lua.conf.xml )
- The Dialplan is also handled via lua. ( configured in the public.xml & default.xml dialplan )
- The internal interface uses a letsencrypt cert
- The database-connection is done via odbc. ( MYSQL on debian8 & Mariadb on debian10 )
- The register-lua, the dialplan and also the fs-core use this database.
What i've done so far:
- Set accept-blind-reg to true, to cross out any problems with the lua file that gets called upon registration.
- tested on 3 different physical servers ( no virtualisation ) with different hardware.
- tested with these os/software combinations: Debian 10 + FS 1.10.3, Debian 8 + FS 1.4.18
- Put the Database on the same host, to cross out a network-issue.
- disabled TLS completely, to cross out problems with Letsencrypt
- Check system-health-parameters while this is happening ( CPU / Disk / MYSQL Processlist / active registrations & subscriptions etc. ) -> nothing special ...
The System so far handles no more than 200 clients. It also happens on a development system, which has < 20 clients and practically 2 concurrent calls max.
The strange thing is, we have production-systems with fs 1.10.3 and also 1.4.18 which work like a charm, but do entirely different things. ( Like no tls, no registrations ) Like one is used as a MS Teams Gateway and the other is used as a media-node behind a kamailio-proxy. Both have no issues at all, so my guess is it has someting to do with registrations or my dialplan.
Sadly i do not have a backtrace, but i will provide one as soon as i have the chance to. Thanks in advance!
Regards, Stefan
just FYI I run with lua dialplan and directory and seems fine. but I use postgresql instead of mysql.
there's some watch-dog config in sofia profile conf, did you try that?
Hi,
yes, i tried that in the beginning.
Somewhere along the way i commented out the watchdog-section entirely.
<!--
<param name="watchdog-enabled" value="no"/>
<param name="watchdog-step-timeout" value="30000"/>
<param name="watchdog-event-timeout" value="30000"/>-->
However, im wondering now what the default behavior is if i comment out this section. no should be the default right?
no. you should uncomment and enable it with yes
Ok, i've enabled the watchdog on our development-server. But im curious as to how this could possibly fix this, because it says in the docu:
Sometimes, in extremely rare edge cases, the Sofia SIP stack may stop responding. These options allow you to enable and control a watchdog on the Sofia SIP stack so that if it stops responding for the specified number of milliseconds, it will cause FreeSWITCH to shut down immediately.
I understand that this is helpful in a cluster environment, to detect a node-failure faster, but this is a standalone server, which just shuts down freeswitch.
Or am i missing something?
thank you very much!
Hi,
it happened again a few minutes ago. This time i got the chance to get a backtrace. ( I hope i did it right, i used this site as reference: https://signalwire.force.com/help/s/article/FreeSWITCH-Crash-Getting-a-Backtrace-From-a-Core-Dump )
Here is the pastepin: https://pastebin.freeswitch.org/view/d385aa0c
- I replaced every domain with sip.example.com
- I replaced every IP with xxx.xxx.xxx.xxx
I hope this helps :)
regards, Stefan
Hi,
Another thing i noticed today when it happened: After about 20 minutes the problem resolved itself. Registrations just began working again as if nothing ever happened.
very strange ...
regards, Stefan
I am having the same issue, but with registrations via XML-Curl. So this is not only LUA related. However when setting caching for registrations (e.g. cacheable="600000" msecs), so that not every registration attempt hits the XML-Curl API, this didn't change anything. In our scenario, I have the feeling that lot's of presence messages make internal profile stuck.
And yes: I have the same behaviour: After some minutes the problem resolved itself.
This happens under high load with a lot of presence involved. Suddenly Freeswitch internal profile stops responding and after a while (some minutes) it responds again. During that time, no register requests are answered at all. Only a few XML-Curl requests are sent to the server, but they do not contain the user info (which are needed for registrations) and are related to mod_voicemail.c etc.
My debug log looks like this
tport.c:3286 tport_tsend() tport_tsend(0x7f9c68005470) tpn = */xx.xxx.xxx.x:39672
tport.c:4075 tport_resolve() tport_resolve addrinfo = xx.xxx.xxx.x:39672
tport.c:4709 tport_by_addrinfo() tport_by_addrinfo(0x7f9c68005470): not found by name */xx.xxx.xxx.x:39672
tport.c:3623 tport_vsend() tport_vsend(0x7f9c68005470): 1065 bytes of 1065 to udp/xx.xxx.xxx.x:39672
tport.c:3521 tport_send_msg() tport_vsend returned 1065
nta.c:8348 outgoing_send() nta: sent NOTIFY (304240670) to */xx.xxx.xxx.x:39672
tport.c:4189 tport_pend() tport_pend(0x7f9c68005470): pending 0x7f9997dcf880 for udp/xxx.xx.x.xxx:5060 (already 124658)
nua_stack.c:569 nua_stack_signal() nua(0x7f9a0bd45ae0): recv signal r_notify
nua_params.c:484 nua_stack_set_params() nua: nua_stack_set_params: entering
soa.c:280 soa_clone() soa_clone(static::0x7f9c68001b90, 0x7f9c68001390, 0x7f9a0bd45ae0) called
soa.c:403 soa_set_params() soa_set_params(static::0x7f9997dd1c90, ...) called
soa.c:403 soa_set_params() soa_set_params(static::0x7f9997dd1c90, ...) called
nta.c:4446 nta_leg_tcreate() nta_leg_tcreate(0x7f9997dcd870)
nta.c:2694 nta_tpn_by_url() nta: selecting scheme sip
tport.c:3286 tport_tsend() tport_tsend(0x7f9c68005470) tpn = udp/xx.xxx.xxx.x:58694
tport.c:4075 tport_resolve() tport_resolve addrinfo = xx.xxx.xxx.x:58694
tport.c:4709 tport_by_addrinfo() tport_by_addrinfo(0x7f9c68005470): not found by name udp/xx.xxx.xxx.x:58694
tport.c:3623 tport_vsend() tport_vsend(0x7f9c68005470): 1114 bytes of 1114 to udp/xx.xxx.xxx.x:58694
tport.c:3521 tport_send_msg() tport_vsend returned 1114
nta.c:8348 outgoing_send() nta: sent NOTIFY (304240671) to udp/xx.xxx.xxx.x:58694
tport.c:4189 tport_pend() tport_pend(0x7f9c68005470): pending 0x7f9997dd2650 for udp/xxx.xx.x.xxx:5060 (already 124659)
Looks like some event queue is stuck?
We have the same issue on a development-server, that has about 20 clients connected to it. So i dont know if its load-related.
Are you using ODBC/Mariadb for the core database by any chance?
regards,
Please use the latest FreeSWITCH. For example this was fixed in v1.10.6:
[Core] switch_core_port_allocator: Replace getaddrinfo() (may get stuck) with switch_sockaddr_new() and fix IPv6.
Also try using native mod_mariadb instead of using ODBC layer.
Are you using ODBC/Mariadb for the core database by any chance? Yes, we are using using ODBC/Mariadb for the core database
Native mod_mariadb created a lot of other problems, so we did not use it.
Today we had a severe incident again. Registers were not answered anymore. Reason was a lot of presence generated by some phones
After applying the code below In "do_normal_probe" in "sofia_presence.c" (Hint from Brian West), the problem went away. ` sql = switch_mprintf("select sip_registrations.sip_user, " "sip_registrations.sub_host, " "sip_registrations.status, " "sip_registrations.rpid, " "'', " "sip_dialogs.uuid, " "sip_dialogs.state, " "sip_dialogs.direction, " "sip_dialogs.sip_to_user, " "sip_dialogs.sip_to_host, " "sip_presence.status," "sip_presence.rpid," "sip_dialogs.presence_id, " "sip_presence.open_closed," "'%q','%q','%q' " "from sip_registrations " "left join sip_dialogs on " "sip_dialogs.hostname = sip_registrations.hostname and sip_dialogs.profile_name = sip_registrations.profile_name and " "sip_dialogs.call_info_state != 'seized' and " "(sip_dialogs.presence_id = sip_registrations.sip_user %q '@' %q sip_registrations.sub_host " "or (sip_dialogs.sip_from_user = sip_registrations.sip_user " "and sip_dialogs.sip_from_host = sip_registrations.sip_host)) " "left join sip_presence on " "sip_presence.hostname=sip_registrations.hostname and " "sip_registrations.sip_user=sip_presence.sip_user and sip_registrations.orig_server_host=sip_presence.sip_host and " "sip_registrations.profile_name=sip_presence.profile_name " "where sip_registrations.hostname='%q' and sip_registrations.profile_name='%q' " "and sip_registrations.sip_user='%q' and " "(sip_registrations.orig_server_host='%q' or sip_registrations.sub_host='%q' " ")", dh.status, dh.rpid, switch_str_nil(sub_call_id), switch_sql_concat(), switch_sql_concat(), mod_sofia_globals.hostname, profile->name, probe_euser, probe_host, probe_euser, probe_host, probe_host);
`
Hi stony,
thanks very much for sharing the code!
You said you had an incident today, so i take it the code isn't tested long term.
Making that change in the freeswitch source requires you to recompile freeswitch ( or at least the module ) and restart freeswitch as far as i know, so are you sure it wasn't the restart that did the trick?
Because when we have that problem, restarting freeswitch always helps ...
regards, Stefan
Restart did not help in our case. So I tried the code, Brian some time exchanged with me. I compiled the code, installed it, and after a restart it worked so far. It seems to be a database issue, it may take too long to query the tables and then it's stuck. The new query is less complex and faster.
I will keep you updated here.
Ahh okay. Thats strange ...
but okay, lets see then. We are currently testing 1.10.6 with mysql ( not mariadb ). I will provide a backtrace when i have the chance.
regards,
I have also had similar issues. In order to figure out whats going on I added SQL query timing.
See #1228 if you wanna apply that patch.
This week it occurred 12 times and its causing us big trouble by now. We are now running version 1.10.3 with the sql patch stony provided. It also didnt help.
I managed to get a backtrace on the last incident while it was happening, now with the right gdb options, as described here: https://freeswitch.org/confluence/display/FREESWITCH/Debugging
I read the whole thing, but im not that good a backtrace-reader. Some things seem odd, although i dont really know what they mean, or if they could cause this.
Here is the pastebin: https://pastebin.freeswitch.org/view/5eded8f0 IPs/Domains are redacted.
For example there is something about "mod_sofia_shutdown" starting on line 5384. Another example starting at line 7315, which is about libcrypto/sql.
Maybe someone can make something of it, i cant.
@skainzwnt issue 1250 is unrelated to your issue. Have you tried mod_mariadb and with the latest freeswitch?
Hi,
Sorry for the late answer. Yes, we're running 1.10.7 for a week now and did not have a problem. I will report back once another week has passed, since it sometimes ran without issues for 2-3 weeks. But it looks good so far. We're also running 1.10.6 since yesterday, but its too soon to say something about that.
On both of them we completely replaced odbc with mod_mariadb.
Thank you for your help so far!
Hi,
it just happened on freeswitch 1.10.6. Sadly the pastebin does not work ( or maybe the backtrace is too long ) So i used this: https://privatebin.net/?09fbf826bb924669#HTGoYHgR93v3DENczuMqrgXJ52Uubq5SeaPMqtGyDr6d
regards,
FYI: The backtrace takes a little bit to load
I found MANY occurences of the following: sw_reg_host = 0x3 <error: Cannot access memory at address 0x3>
Is that something that could cause all this?
regards,
I split the backtrace in 3 files: Part 1: https://pastebin.freeswitch.org/view/74adebb7 Part 2: https://pastebin.freeswitch.org/view/9eca944a Part 3: https://pastebin.freeswitch.org/view/9a5881d2
The privatebin is probably already expired.
Since sofia only uses one thread per default, i have conducted a few tests. For example doing a sleep in my register.lua. As soon as one registration is stuck that way, no other registrations after that gets a response. That changes when i use the following settings in sofia.conf.xml:
<param name="inbound-reg-in-new-thread" value="true"/>
<param name="max-reg-threads" value="16"/>
Im using 16 threads now. My hope is, that when one gets stuck, freeswitch will use another one, until the stuck one gets responsive again. It seems to work with registrations, but i did not have time to also test subscriptions. I hope this will work too. Also, when this sofia freeze happens naturally i can not restart the internal interface in fs_cli - it also gets stuck. But when i use my sleep-method i can restart it ...
Additionally i found the following discussion: https://stackoverflow.com/questions/53609817/freeswitch-blocked
So i now have a file called post_load_switch.conf in the autoload_configs directory with the following settings:
<param name="events-use-dispatch" value="false"/>
<param name="initial-event-threads" value="6"/>
Lets see if that works ...
Hello! @skainzwnt did you resolve freeze problem? I got the same error. Registrations and invites sometimes become frozen for 10-15 minutes, then it resumes to normal working. Existing calls still works. I suspect, it related with WSS transport (without it the system worked without freeze).
Hi,
no, i never resolved the issue. We now use kamailio in combination with freeswitch. kamailio handles registrations and presence, freeswitch the rest.
We dont really use WSS, but we do use tls. And the strange thing is - from time to time - kamailio also dies. But kamailio at least provides a error-message. "generator:rand_pool_add:internal error".
However this seems to be known: https://github.com/kamailio/kamailio/issues/2394
Its a incompabilty with openssl1.1 and our version of kamailio. We plan to update kamailio - this should solve the issue.
Maybe freeswitch has the same problem with openssl1.1.
regards,
Hi, did anybody try freeSWITCH with the latest version of openssl ? does it solve the issue ?