freeswitch
freeswitch copied to clipboard
mod_event_socket can have dead listeners with overflowing event queues that fill the log with CRITs
Describe the bug
The mod_event_socket.c
listen_list
retains listeners for sockets that no longer exist, fills their queues with events, and then spams the logs when the queues overflow. Both messages continue to repeat even after flags
has LFLAG_RUNNING
unset. The only effect seems to be that the machine generates a continuous stream of CRIT events and rapidly fills up our logs. A sample machine logged the below pair of CRIT messages 12604 times in a five minute period (4.1MB of logging space). I suspect that if a listener is started for an fs_cli
session and the client process is harshly killed off, maybe the socket death is never noticed by freeswitch??
2023-11-09 22:02:19.732219 89.73% [CRIT] mod_event_socket.c:397 Event enqueue ERROR [11] | [Resource temporarily unavailable] | Queue size: [100000/100000] Max queue size reached
2023-11-09 22:02:19.732219 89.73% [CRIT] mod_event_socket.c:401 Killing listener because of too many lost events. Lost [5544469] Queue size[100000/100000]
(repeats)
The listen_list
linked list contains references to sockets that don't exist anymore. Per lsof
, there is one open TCP connection but per listen_list
there are two open sockets:
$ lsof -nPp 13 | grep TCP
freeswitc 12 root 41u IPv4 1433856237 0t0 TCP 127.0.0.1:8021 (LISTEN)
freeswitc 12 root 47u IPv4 1444173034 0t0 TCP 127.0.0.1:8021->127.0.0.1:56810 (ESTABLISHED)
Per the following gdb dump, the listen_list
struct has a non-running listener with a continuously overflowing queue of events:
-
listen_list->listeners
on TCP connection 8021->127.0.0.1:56810 with flags (LFLAG_AUTHED|LFLAG_EVENTS|LFLAG_FULL|FLAG_ALLOW_LOG|LFLAG_RUNNING) -
listen_list->listeners->next
on TCP connection 8021->127.0.0.1:56112 with flags (LFLAG_AUTHED|LFLAG_EVENTS|LFLAG_FULL|FLAG_ALLOW_LOG)
GDB dump (massaged for conciseness):
(gdb) p listen_list
$2 = {sock = 0x7f3224000e08, sock_mutex = 0x56136766fd90, listeners = 0x7f32246379a8, ready = 1 '\001'}
(gdb) p *listen_list->sock
$4 = {pool = 0x7f3224000c58, socketdes = 41, type = 1, protocol = 6, local_addr = 0x7f3224000d38, remote_addr = 0x7f3224000f18, timeout = -1, local_port_unknown = 0, local_interface_unknown = 0, remote_addr_unknown = 1, options = 16, inherit = 0, userdata = 0x0}
(gdb) p *listen_list->sock->local_addr
$6 = {pool = 0x7f3224000c58, hostname = 0x7f3224000df8 "127.0.0.1", servname = 0x0, port = 8021, family = 2, salen = 16, ipaddr_len = 4, addr_str_len = 16, ipaddr_ptr = 0x7f3224000d7c, next = 0x0, sa = {sin = {sin_family = 2, sin_port = 21791, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 21791, sin6_flowinfo = 16777343, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, __ss_padding = "\037U\177\000\000\001", '\000' <repeats 111 times>, __ss_align = 0}}}
(gdb) p *listen_list->sock->remote_addr
$8 = {pool = 0x7f3224000c58, hostname = 0x0, servname = 0x0, port = 0, family = 2, salen = 16, ipaddr_len = 4, addr_str_len = 16, ipaddr_ptr = 0x7f3224000f5c, next = 0x0, sa = {sin = {sin_family = 2, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 0, sin6_flowinfo = 0, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, __ss_padding = '\000' <repeats 117 times>, __ss_align = 0}}}
(gdb) p *listen_list->listeners
$11 = {sock = 0x7f32246377b8, event_queue = 0x7f3224637bc0, log_queue = 0x7f3224720f58, pool = 0x7f32246376d8, format = EVENT_FORMAT_JSON, flag_mutex = 0x7f32247e4f68, filter_mutex = 0x7f32247e4fb8, flags = 65559, level = SWITCH_LOG_CONSOLE, ebuf = 0x0, event_list = '\001' <repeats 93 times>, allowed_event_list = '\000' <repeats 92 times>, event_hash = 0x7f3224002f70, allowed_event_hash = 0x0, allowed_api_hash = 0x0, rwlock = 0x7f3224637b60, session = 0x0, lost_events = 0, lost_logs = 0, last_flush = 0, expire_time = 0, timeout = 0, id = 0, sa = 0x7f32246378c8, remote_ip = "127.0.0.1", '\000' <repeats 40 times>, remote_port = 56810, filters = 0x0, linger_timeout = 0, next = 0x7f32244a7948, pollfd = 0x7f32247e50c0, lock_acquired = 0 '\000', finished = 0 '\000'}
(gdb) p *listen_list->listeners->sock
$13 = {pool = 0x7f32246376d8, socketdes = 47, type = 1, protocol = 6, local_addr = 0x7f3224637808, remote_addr = 0x7f32246378c8, timeout = -1, local_port_unknown = 0, local_interface_unknown = 0, remote_addr_unknown = 0, options = 520, inherit = 0, userdata = 0x0}
(gdb) p *listen_list->listeners->sock->local_addr
$15 = {pool = 0x7f32246376d8, hostname = 0x7f3224000df8 "127.0.0.1", servname = 0x0, port = 8021, family = 2, salen = 16, ipaddr_len = 4, addr_str_len = 16, ipaddr_ptr = 0x7f322463784c, next = 0x0, sa = {sin = {sin_family = 2, sin_port = 21791, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 21791, sin6_flowinfo = 16777343, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, __ss_padding = "\037U\177\000\000\001", '\000' <repeats 111 times>, __ss_align = 0}}}
(gdb) p *listen_list->listeners->sock->remote_addr
$17 = {pool = 0x7f32246376d8, hostname = 0x0, servname = 0x0, port = 56810, family = 2, salen = 16, ipaddr_len = 4, addr_str_len = 16, ipaddr_ptr = 0x7f322463790c, next = 0x0, sa = {sin = {sin_family = 2, sin_port = 60125, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 60125, sin6_flowinfo = 16777343, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, __ss_padding = "\335\352\177\000\000\001", '\000' <repeats 111 times>, __ss_align = 0}}}
(gdb) p *listen_list->listeners->next
$22 = {sock = 0x7f32244a7758, event_queue = 0x7f32244a7b60, log_queue = 0x7f32240c66b8, pool = 0x7f32244a7678, format = EVENT_FORMAT_JSON, flag_mutex = 0x7f322418a6c8, filter_mutex = 0x7f322418a718, flags = 65557, level = SWITCH_LOG_CONSOLE, ebuf = 0x0, event_list = '\001' <repeats 93 times>, allowed_event_list = '\000' <repeats 92 times>, event_hash = 0x7f3224002f30, allowed_event_hash = 0x0, allowed_api_hash = 0x0, rwlock = 0x7f32244a7b00, session = 0x0, lost_events = 5602791, lost_logs = 0, last_flush = 0, expire_time = 0, timeout = 0, id = 0, sa = 0x7f32244a7868, remote_ip = "127.0.0.1", '\000' <repeats 40 times>, remote_port = 56112, filters = 0x0, linger_timeout = 0, next = 0x0, pollfd = 0x7f322418a820, lock_acquired = 0 '\000', finished = 0 '\000'}
(gdb) p *listen_list->listeners->next->sock
$24 = {pool = 0x7f32244a7678, socketdes = -1, type = 1, protocol = 6, local_addr = 0x7f32244a77a8, remote_addr = 0x7f32244a7868, timeout = -1, local_port_unknown = 0, local_interface_unknown = 0, remote_addr_unknown = 0, options = 520, inherit = 0, userdata = 0x0}
(gdb) p *listen_list->listeners->next->sock->local_addr
$26 = {pool = 0x7f32244a7678, hostname = 0x7f3224000df8 "127.0.0.1", servname = 0x0, port = 8021, family = 2, salen = 16, ipaddr_len = 4, addr_str_len = 16, ipaddr_ptr = 0x7f32244a77ec, next = 0x0, sa = {sin = {sin_family = 2, sin_port = 21791, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 21791, sin6_flowinfo = 16777343, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, __ss_padding = "\037U\177\000\000\001", '\000' <repeats 111 times>, __ss_align = 0}}}
(gdb) p *listen_list->listeners->next->sock->remote_addr
$28 = {pool = 0x7f32244a7678, hostname = 0x0, servname = 0x0, port = 56112, family = 2, salen = 16, ipaddr_len = 4, addr_str_len = 16, ipaddr_ptr = 0x7f32244a78ac, next = 0x0, sa = {sin = {sin_family = 2, sin_port = 12507, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 12507, sin6_flowinfo = 16777343, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, __ss_padding = "\333\060\177\000\000\001", '\000' <repeats 111 times>, __ss_align = 0}}}
To Reproduce Steps to reproduce the behavior:
- I am unsure what causes this but know that I can wait and this reproduces on random machines in my production.
Expected behavior
- A socket being closed results in the listener also removing the associated listener
- A listener being killed should remove the listener from continuing to receive events
Package version or git hash
- Version 1.10.10-release-24-4cb05e7f4a~64bit
from:
deb [signed-by=/usr/share/keyrings/signalwire-freeswitch-repo.gpg] https://freeswitch.signalwire.com/repo/deb/debian-release/ bullseye main
deb-src [signed-by=/usr/share/keyrings/signalwire-freeswitch-repo.gpg] https://freeswitch.signalwire.com/repo/deb/debian-release/ bullseye main
gdb dump repro script
#!/bin/bash
tmpdir=$(mktemp -d)
trap 'rm -rf $tmpdir' EXIT
cat <<'GDB_EOF' >$tmpdir/show_listen_list.gdb
define show_listener
p "*$arg0"
p *$arg0
p "*$arg0->sock"
p *$arg0->sock
p "$arg0->sock->local_addr"
p *$arg0->sock->local_addr
p "$arg0->sock->remote_addr"
p *$arg0->sock->remote_addr
end
define show_listeners
p "show_listener $arg0"
show_listener $arg0
p "if $arg0->next != 0"
if $arg0->next != 0
print "show_listeners $arg0->next"
show_listeners $arg0->next
end
end
p "listen_list"
p listen_list
p "listen_list->sock"
p *listen_list->sock
p "listen_list->sock->local_addr"
p *listen_list->sock->local_addr
p "listen_list->sock->remote_addr"
p *listen_list->sock->remote_addr
show_listeners listen_list->listeners
GDB_EOF
cat <<'SH_EOF' >$tmpdir/show_listen_list_gdb.sh
#!/bin/bash
apt-get update
apt-get install -qqy gdb lsof freeswitch-dbg libfreeswitch1-dbg freeswitch-mod-event-socket-dbg
for proc_dir in $(find /proc -maxdepth 1 -type d -name '[1-9]*'); do
if [[ /usr/bin/freeswitch == $(readlink -f $proc_dir/exe) ]] && grep -qF /usr/lib/freeswitch/mod/mod_event_socket.so $proc_dir/maps; then
pid=$(basename $proc_dir)
gdb -p $pid --batch -x /tmp/show_listen_list.gdb
lsof -nPp $pid | grep TCP
fi
done
SH_EOF
chmod +x $tmpdir/show_listen_list_gdb.sh
docker cp $tmpdir/show_listen_list.gdb cx-freeswitch:/tmp/show_listen_list.gdb
docker cp $tmpdir/show_listen_list_gdb.sh cx-freeswitch:/tmp/show_listen_list_gdb.sh
docker exec cx-freeswitch bash /tmp/show_listen_list_gdb.sh
lsof -nPp $(docker inspect -f '{{.State.Pid}}' cx-freeswitch) | grep TCP
~~PS, I do wonder if there's something about running both or either of freeswitch and/or fs_cli inside a container without an effective --init
to do waitpid() that might cause a client-side failure to not signal to freeswitch that the socket had closed.~~
Even if the above is true, I would still expect the dead listener to eventually be removed.
On looking at the threads, it seems like the listener thread is deadlocked with some other threads on a session:
-
session->codec_init_mutex
inswitch_core_media_set_codec
atswitch_core_media.c:3607
for the listener thread for auuid_answer
on a send/receive thread (Thread 26) -
tech_pvt->sofia_mutex
insofia_receive_message
atmod_sofia.c:1526
for&park
on a sendonly eavesdrop channel (Thread 27) -
tech_pvt->sofia_mutex
insofia_receive_message
atmod_sofia.c:1526
for setting up a media bug (Thread 28)
The codec deadlock seems seems to also show up at:
- https://github.com/signalwire/freeswitch/issues/2290
- https://github.com/signalwire/freeswitch/issues/2264
The flooded listener's event queue seems to also show up at:
- https://github.com/signalwire/freeswitch/issues/2143
https://github.com/signalwire/freeswitch/pull/2338 - I've had a variant of this made some years ago, it helped me to tune/tweak the external app that read the ESL events via the event_socket. Maybe it will help you. Your problem seems to be that your app does not read the events fast enough, which is a common problem. The queue should never get full.
Thanks @dragos-oancea, I think that's not quite the issue.
In this case, the latency is "good enough" when the threads are not deadlocked and then the latency changes to "dead stop." The controlling application is running some api commands like uuid_answer
and uuid_preanswer
while some backgrounded api calls to originate eavesdrop()
are running. I think this issue might be mitigated by a change to the application to avoid doing api commands that require initializing the session's codecs while other backgrounded jobs will also be initializing the session codecs.
We would hope that the concurrent threads would use the codec mutexes correctly and then the first one to initialize would do the work, followed by every other thread just using the initialized session codecs but that seems to be the crux of the defect.
In this case, the queue becomes full because the speed of processing after deadlock is 0 and it cannot recover without reloading the entire module (typically by restarting the freeswitch process)
Hey there, I have a PR up which may be of interest to you. https://github.com/signalwire/freeswitch/pull/2275 I just came across your issue while browsing and it sounded very familiar. In my case it was also triggered by fs_cli sessions which were exited in an unclean fashion.