freeswitch mod_event_socket can have dead listeners with overflowing event queues that fill the log with CRITs

Describe the bug

The mod_event_socket.c listen_list retains listeners for sockets that no longer exist, fills their queues with events, and then spams the logs when the queues overflow. Both messages continue to repeat even after flags has LFLAG_RUNNING unset. The only effect seems to be that the machine generates a continuous stream of CRIT events and rapidly fills up our logs. A sample machine logged the below pair of CRIT messages 12604 times in a five minute period (4.1MB of logging space). I suspect that if a listener is started for an fs_cli session and the client process is harshly killed off, maybe the socket death is never noticed by freeswitch??

2023-11-09 22:02:19.732219 89.73% [CRIT] mod_event_socket.c:397 Event enqueue ERROR [11] | [Resource temporarily unavailable] | Queue size: [100000/100000] Max queue size reached
2023-11-09 22:02:19.732219 89.73% [CRIT] mod_event_socket.c:401 Killing listener because of too many lost events. Lost [5544469] Queue size[100000/100000]
(repeats)

The listen_list linked list contains references to sockets that don't exist anymore. Per lsof, there is one open TCP connection but per listen_list there are two open sockets:

$ lsof -nPp 13 | grep TCP
freeswitc  12 root   41u     IPv4         1433856237      0t0        TCP 127.0.0.1:8021 (LISTEN)
freeswitc  12 root   47u     IPv4         1444173034      0t0        TCP 127.0.0.1:8021->127.0.0.1:56810 (ESTABLISHED)

Per the following gdb dump, the listen_list struct has a non-running listener with a continuously overflowing queue of events:

listen_list->listeners on TCP connection 8021->127.0.0.1:56810 with flags (LFLAG_AUTHED|LFLAG_EVENTS|LFLAG_FULL|FLAG_ALLOW_LOG|LFLAG_RUNNING)
listen_list->listeners->next on TCP connection 8021->127.0.0.1:56112 with flags (LFLAG_AUTHED|LFLAG_EVENTS|LFLAG_FULL|FLAG_ALLOW_LOG)

GDB dump (massaged for conciseness):

(gdb) p listen_list
$2 = {sock = 0x7f3224000e08, sock_mutex = 0x56136766fd90, listeners = 0x7f32246379a8, ready = 1 '\001'}

(gdb) p *listen_list->sock
$4 = {pool = 0x7f3224000c58, socketdes = 41, type = 1, protocol = 6, local_addr = 0x7f3224000d38, remote_addr = 0x7f3224000f18, timeout = -1, local_port_unknown = 0, local_interface_unknown = 0, remote_addr_unknown = 1, options = 16, inherit = 0, userdata = 0x0}

(gdb) p *listen_list->sock->local_addr
$6 = {pool = 0x7f3224000c58, hostname = 0x7f3224000df8 "127.0.0.1", servname = 0x0, port = 8021, family = 2, salen = 16, ipaddr_len = 4, addr_str_len = 16, ipaddr_ptr = 0x7f3224000d7c, next = 0x0, sa = {sin = {sin_family = 2, sin_port = 21791, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 21791, sin6_flowinfo = 16777343, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, __ss_padding = "\037U\177\000\000\001", '\000' <repeats 111 times>, __ss_align = 0}}}

(gdb) p *listen_list->sock->remote_addr
$8 = {pool = 0x7f3224000c58, hostname = 0x0, servname = 0x0, port = 0, family = 2, salen = 16, ipaddr_len = 4, addr_str_len = 16, ipaddr_ptr = 0x7f3224000f5c, next = 0x0, sa = {sin = {sin_family = 2, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 0, sin6_flowinfo = 0, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, __ss_padding = '\000' <repeats 117 times>, __ss_align = 0}}}

(gdb) p *listen_list->listeners
$11 = {sock = 0x7f32246377b8, event_queue = 0x7f3224637bc0, log_queue = 0x7f3224720f58, pool = 0x7f32246376d8, format = EVENT_FORMAT_JSON, flag_mutex = 0x7f32247e4f68, filter_mutex = 0x7f32247e4fb8, flags = 65559, level = SWITCH_LOG_CONSOLE, ebuf = 0x0, event_list = '\001' <repeats 93 times>, allowed_event_list = '\000' <repeats 92 times>, event_hash = 0x7f3224002f70, allowed_event_hash = 0x0, allowed_api_hash = 0x0, rwlock = 0x7f3224637b60, session = 0x0, lost_events = 0, lost_logs = 0, last_flush = 0, expire_time = 0, timeout = 0, id = 0, sa = 0x7f32246378c8, remote_ip = "127.0.0.1", '\000' <repeats 40 times>, remote_port = 56810, filters = 0x0, linger_timeout = 0, next = 0x7f32244a7948, pollfd = 0x7f32247e50c0, lock_acquired = 0 '\000', finished = 0 '\000'}

(gdb) p *listen_list->listeners->sock
$13 = {pool = 0x7f32246376d8, socketdes = 47, type = 1, protocol = 6, local_addr = 0x7f3224637808, remote_addr = 0x7f32246378c8, timeout = -1, local_port_unknown = 0, local_interface_unknown = 0, remote_addr_unknown = 0, options = 520, inherit = 0, userdata = 0x0}

(gdb) p *listen_list->listeners->sock->local_addr
$15 = {pool = 0x7f32246376d8, hostname = 0x7f3224000df8 "127.0.0.1", servname = 0x0, port = 8021, family = 2, salen = 16, ipaddr_len = 4, addr_str_len = 16, ipaddr_ptr = 0x7f322463784c, next = 0x0, sa = {sin = {sin_family = 2, sin_port = 21791, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 21791, sin6_flowinfo = 16777343, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, __ss_padding = "\037U\177\000\000\001", '\000' <repeats 111 times>, __ss_align = 0}}}

(gdb) p *listen_list->listeners->sock->remote_addr
$17 = {pool = 0x7f32246376d8, hostname = 0x0, servname = 0x0, port = 56810, family = 2, salen = 16, ipaddr_len = 4, addr_str_len = 16, ipaddr_ptr = 0x7f322463790c, next = 0x0, sa = {sin = {sin_family = 2, sin_port = 60125, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 60125, sin6_flowinfo = 16777343, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, __ss_padding = "\335\352\177\000\000\001", '\000' <repeats 111 times>, __ss_align = 0}}}

(gdb) p *listen_list->listeners->next
$22 = {sock = 0x7f32244a7758, event_queue = 0x7f32244a7b60, log_queue = 0x7f32240c66b8, pool = 0x7f32244a7678, format = EVENT_FORMAT_JSON, flag_mutex = 0x7f322418a6c8, filter_mutex = 0x7f322418a718, flags = 65557, level = SWITCH_LOG_CONSOLE, ebuf = 0x0, event_list = '\001' <repeats 93 times>, allowed_event_list = '\000' <repeats 92 times>, event_hash = 0x7f3224002f30, allowed_event_hash = 0x0, allowed_api_hash = 0x0, rwlock = 0x7f32244a7b00, session = 0x0, lost_events = 5602791, lost_logs = 0, last_flush = 0, expire_time = 0, timeout = 0, id = 0, sa = 0x7f32244a7868, remote_ip = "127.0.0.1", '\000' <repeats 40 times>, remote_port = 56112, filters = 0x0, linger_timeout = 0, next = 0x0, pollfd = 0x7f322418a820, lock_acquired = 0 '\000', finished = 0 '\000'}

(gdb) p *listen_list->listeners->next->sock
$24 = {pool = 0x7f32244a7678, socketdes = -1, type = 1, protocol = 6, local_addr = 0x7f32244a77a8, remote_addr = 0x7f32244a7868, timeout = -1, local_port_unknown = 0, local_interface_unknown = 0, remote_addr_unknown = 0, options = 520, inherit = 0, userdata = 0x0}

(gdb) p *listen_list->listeners->next->sock->local_addr
$26 = {pool = 0x7f32244a7678, hostname = 0x7f3224000df8 "127.0.0.1", servname = 0x0, port = 8021, family = 2, salen = 16, ipaddr_len = 4, addr_str_len = 16, ipaddr_ptr = 0x7f32244a77ec, next = 0x0, sa = {sin = {sin_family = 2, sin_port = 21791, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 21791, sin6_flowinfo = 16777343, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, __ss_padding = "\037U\177\000\000\001", '\000' <repeats 111 times>, __ss_align = 0}}}

(gdb) p *listen_list->listeners->next->sock->remote_addr
$28 = {pool = 0x7f32244a7678, hostname = 0x0, servname = 0x0, port = 56112, family = 2, salen = 16, ipaddr_len = 4, addr_str_len = 16, ipaddr_ptr = 0x7f32244a78ac, next = 0x0, sa = {sin = {sin_family = 2, sin_port = 12507, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 12507, sin6_flowinfo = 16777343, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, __ss_padding = "\333\060\177\000\000\001", '\000' <repeats 111 times>, __ss_align = 0}}}

To Reproduce Steps to reproduce the behavior:

I am unsure what causes this but know that I can wait and this reproduces on random machines in my production.

Expected behavior

A socket being closed results in the listener also removing the associated listener
A listener being killed should remove the listener from continuing to receive events

Package version or git hash

Version 1.10.10-release-24-4cb05e7f4a~64bit

from:

  deb [signed-by=/usr/share/keyrings/signalwire-freeswitch-repo.gpg] https://freeswitch.signalwire.com/repo/deb/debian-release/ bullseye main
  deb-src [signed-by=/usr/share/keyrings/signalwire-freeswitch-repo.gpg] https://freeswitch.signalwire.com/repo/deb/debian-release/ bullseye main

gdb dump repro script

#!/bin/bash
tmpdir=$(mktemp -d)
trap 'rm -rf $tmpdir' EXIT

cat <<'GDB_EOF' >$tmpdir/show_listen_list.gdb
define show_listener
  p "*$arg0"
  p *$arg0

  p "*$arg0->sock"
  p *$arg0->sock

  p "$arg0->sock->local_addr"
  p *$arg0->sock->local_addr

  p "$arg0->sock->remote_addr"
  p *$arg0->sock->remote_addr
end

define show_listeners
  p "show_listener $arg0"
  show_listener $arg0

  p "if $arg0->next != 0"
  if $arg0->next != 0
    print "show_listeners $arg0->next"
    show_listeners $arg0->next
  end
end

p "listen_list"
p listen_list

p "listen_list->sock"
p *listen_list->sock

p "listen_list->sock->local_addr"
p *listen_list->sock->local_addr

p "listen_list->sock->remote_addr"
p *listen_list->sock->remote_addr

show_listeners listen_list->listeners
GDB_EOF

cat <<'SH_EOF' >$tmpdir/show_listen_list_gdb.sh
#!/bin/bash
apt-get update
apt-get install -qqy gdb lsof freeswitch-dbg libfreeswitch1-dbg freeswitch-mod-event-socket-dbg

for proc_dir in $(find /proc -maxdepth 1 -type d -name '[1-9]*'); do
    if [[ /usr/bin/freeswitch == $(readlink -f $proc_dir/exe) ]] && grep -qF /usr/lib/freeswitch/mod/mod_event_socket.so $proc_dir/maps; then
        pid=$(basename $proc_dir)
        gdb -p $pid --batch -x /tmp/show_listen_list.gdb
        lsof -nPp $pid | grep TCP
    fi
done
SH_EOF
chmod +x $tmpdir/show_listen_list_gdb.sh

docker cp $tmpdir/show_listen_list.gdb    cx-freeswitch:/tmp/show_listen_list.gdb
docker cp $tmpdir/show_listen_list_gdb.sh cx-freeswitch:/tmp/show_listen_list_gdb.sh
docker exec cx-freeswitch bash  /tmp/show_listen_list_gdb.sh

lsof -nPp $(docker inspect -f '{{.State.Pid}}' cx-freeswitch) | grep TCP

Nov 09 '23 23:11 mgruberman

~~PS, I do wonder if there's something about running both or either of freeswitch and/or fs_cli inside a container without an effective --init to do waitpid() that might cause a client-side failure to not signal to freeswitch that the socket had closed.~~

Even if the above is true, I would still expect the dead listener to eventually be removed.

Nov 10 '23 14:11 mgruberman

On looking at the threads, it seems like the listener thread is deadlocked with some other threads on a session:

session->codec_init_mutex in switch_core_media_set_codec at switch_core_media.c:3607 for the listener thread for a uuid_answer on a send/receive thread (Thread 26)
tech_pvt->sofia_mutex in sofia_receive_message at mod_sofia.c:1526 for &park on a sendonly eavesdrop channel (Thread 27)
tech_pvt->sofia_mutex in sofia_receive_message at mod_sofia.c:1526 for setting up a media bug (Thread 28)

mod_event_socket-threads.txt

The codec deadlock seems seems to also show up at:

https://github.com/signalwire/freeswitch/issues/2290
https://github.com/signalwire/freeswitch/issues/2264

The flooded listener's event queue seems to also show up at:

https://github.com/signalwire/freeswitch/issues/2143

Nov 15 '23 17:11 mgruberman

https://github.com/signalwire/freeswitch/pull/2338 - I've had a variant of this made some years ago, it helped me to tune/tweak the external app that read the ESL events via the event_socket. Maybe it will help you. Your problem seems to be that your app does not read the events fast enough, which is a common problem. The queue should never get full.

Dec 19 '23 00:12 dragos-oancea

Thanks @dragos-oancea, I think that's not quite the issue.

In this case, the latency is "good enough" when the threads are not deadlocked and then the latency changes to "dead stop." The controlling application is running some api commands like uuid_answer and uuid_preanswer while some backgrounded api calls to originate eavesdrop() are running. I think this issue might be mitigated by a change to the application to avoid doing api commands that require initializing the session's codecs while other backgrounded jobs will also be initializing the session codecs.

We would hope that the concurrent threads would use the codec mutexes correctly and then the first one to initialize would do the work, followed by every other thread just using the initialized session codecs but that seems to be the crux of the defect.

In this case, the queue becomes full because the speed of processing after deadlock is 0 and it cannot recover without reloading the entire module (typically by restarting the freeswitch process)

Jan 09 '24 20:01 mgruberman

Hey there, I have a PR up which may be of interest to you. https://github.com/signalwire/freeswitch/pull/2275 I just came across your issue while browsing and it sounded very familiar. In my case it was also triggered by fs_cli sessions which were exited in an unclean fashion.

Jan 29 '24 15:01 olofssonanton

freeswitch freeswitch copied to clipboard

mod_event_socket can have dead listeners with overflowing event queues that fill the log with CRITs

freeswitch
freeswitch copied to clipboard