mdsplus
mdsplus copied to clipboard
MDSplus dispatcher opens lots of files
Affiliation GA/DIII-D
Version(s) Affected Our production server, atlas, currently suffers this bug. @sflanagan will provide the version at some point.
Platform RHEL8
Describe the bug The MDSplus dispatcher leaves open files, filling up the limit of open files. Somehow the limit of number of open files is not honored by the OS or dispatcher, even after attempting to increase the limit in various ways in various configuration files.
To Reproduce Our production server, atlas, currently suffers this bug. Details to be provided at some point by @sflanagan .
Expected behavior I expect that the dispatcher will close its own files as it goes, without accumulating open files.
Additional context Any fix provided needs to be on a branch starting from the version that is currently on atlas.
I think you need to add dispatch/close
to the script that is doing the dispatch/build.
TCL> help dispatch/close
Command: DISPATCH/CLOSE
Purpose: Instruct MDSplus actions server(s) to close all open trees.
Format: DISPATCH/CLOSE[/SERVER=(server1,server2...)]
Description:
The DISPATCH/CLOSE can be used to instruct an MDSplus action server to close all MDSplus
trees the server currently has open. If the /SERVER qualifier is not included to specify
any action servers then the current dispatch table (See: DISPATCH/BUILD and
DISPATCH/PHASE commands) is inspected to construct a list of servers that had actions
dispatched to them to send close instructions to. The DISPATCH/CLOSE command is typically
used at the end of a data acquisition cycle to instruct the action server to close
any open trees.
TCL>
I am happy to meet this afternoon to discuss...
According to Mitchell's post on another issue, Atlas has been running alpha-7.139.59
since November 2023. And that Atlas is on RHEL8.
Issue 2704, post of 13-Feb-2024
There's likely more to this than open files, as enabling core dumps against the systemd
services running the "disaptcher" (i.e. mdsip -s
on atlas:8002
) and the "mdsloader" (i.e. mdsip -s
on atlas:8003
) comes up with some potential issues in mdstcpip/mdsipshr/ConnectToMds.c:
?
I haven't looked at this too closely, but for the sake of notes I'll just quickly post the coredumpctl gdb <pid>; bt
of the last core dump produced (this one is from the port 8002 side):
[root@atlas ~]# coredumpctl gdb 865034
PID: 865034 (mdsip)
UID: 117 (mdsadmin)
GID: 511 (mdsdev)
Signal: 11 (SEGV)
Timestamp: Mon 2024-03-25 21:44:26 PDT (12h ago)
Command Line: /usr/local/mdsplus/bin/mdsip -s -p 8002 -h /etc/mdsip.hosts
Executable: /usr/local/mdsplus/bin64/mdsip
Control Group: /
Slice: -.slice
Boot ID: 7392c4d0130e43a9a07fbde2f35cd9c7
Machine ID: 1a91ee619a7f42d0b73da694e83b0be3
Hostname: atlas.gat.com
Storage: /var/lib/systemd/coredump/core.mdsip.117.7392c4d0130e43a9a07fbde2f35cd9c7.865034.1711428266000000.lz4
Message: Process 865034 (mdsip) of user 117 dumped core.
Stack trace of thread 866623:
#0 0x00001468186e2c61 do_login (libMdsIpShr.so)
#1 0x000014681530a01e server_connect (libMdsServerShr.so)
#2 0x000014681530ba91 ServerConnect (libMdsServerShr.so)
#3 0x000014681530bc57 ServerSendMessage (libMdsServerShr.so)
#4 0x000014681530cdd7 ServerDispatchAction (libMdsServerShr.so)
#5 0x0000146815307ec6 dispatch (libMdsServerShr.so)
#6 0x000014681530834c ServerDispatchPhase (libMdsServerShr.so)
#7 0x0000146814868521 TclDispatch_phase (libtcl_commands.so)
#8 0x00001468150f1a9a dispatchToHandler (libMdsdcl.so)
#9 0x00001468150f30b3 cmdExecute (libMdsdcl.so)
#10 0x00001468150f6f04 mdsdcl_do_command_extra_args (libMdsdcl.so)
#11 0x00001468150f2abb mdsdcl_do_command (libMdsdcl.so)
#12 0x0000146815304449 DoSrvCommand (libMdsServerShr.so)
#13 0x0000146815304b29 WorkerThread (libMdsServerShr.so)
#14 0x000014681808b1ca start_thread (libpthread.so.0)
#15 0x0000146817cf7e73 __clone (libc.so.6)
Stack trace of thread 865034:
#0 0x0000146817de4a3f __select (libc.so.6)
#1 0x0000146816d8e877 listen_loop (libMdsIpTCP.so)
#2 0x0000146816d8ecc6 io_listen (libMdsIpTCP.so)
#3 0x00000000004008e8 main (mdsip)
#4 0x0000146817cf8d85 __libc_start_main (libc.so.6)
#5 0x000000000040097e _start (mdsip)
Stack trace of thread 866830:
#0 0x000014681809145c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000146815306f07 send_monitor_pop (libMdsServerShr.so)
#2 0x00001468153076ab send_monitor_thread (libMdsServerShr.so)
#3 0x000014681808b1ca start_thread (libpthread.so.0)
#4 0x0000146817cf7e73 __clone (libc.so.6)
Stack trace of thread 1046533:
#0 0x0000146817de4a3f __select (libc.so.6)
#1 0x0000146816d8e154 io_recv_to (libMdsIpTCP.so)
#2 0x00001468186d9304 get_bytes_to (libMdsIpShr.so)
#3 0x00001468186d93e2 get_bytes_to (libMdsIpShr.so)
#4 0x00001468186e2eba ConnectionDoMessage (libMdsIpShr.so)
#5 0x0000146816d8d492 client_thread (libMdsIpTCP.so)
#6 0x000014681808b1ca start_thread (libpthread.so.0)
#7 0x0000146817cf7e73 __clone (libc.so.6)
Stack trace of thread 866831:
#0 0x0000146817de4a3f __select (libc.so.6)
#1 0x000014681530b42e receiver_thread (libMdsServerShr.so)
#2 0x000014681808b1ca start_thread (libpthread.so.0)
#3 0x0000146817cf7e73 __clone (libc.so.6)
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-19.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/mdsplus/bin64/mdsip...done.
[New LWP 866623]
[New LWP 865034]
[New LWP 866830]
[New LWP 1046533]
[New LWP 866831]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `/usr/local/mdsplus/bin/mdsip -s -p 8002 -h /etc/mdsip.hosts'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 do_login (c=0x1) at /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c:78
78 /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c: No such file or directory.
[Current thread is 1 (Thread 0x146814c71700 (LWP 866623))]
Missing separate debuginfos, use: yum debuginfo-install mdsplus-alpha-kernel_bin-7.139-59.el8.x86_64
(gdb) bt
#0 do_login (c=0x1) at /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c:78
#1 ConnectToMds (hostin=hostin@entry=0x1468080fdea0 "atlas.gat.com:8003") at /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c:161
#2 0x000014681530a01e in server_connect (server=server@entry=0x1468080fdea0 "atlas.gat.com:8003", addr=1755710912, port=<optimized out>)
at /git/mdsplus/servershr/ServerSendMessage.c:529
#3 0x000014681530ba91 in ServerConnect (server_in=server_in@entry=0x146814c70530 "mdsloader") at /git/mdsplus/servershr/ServerSendMessage.c:557
#4 0x000014681530bc57 in ServerSendMessage (msgid=msgid@entry=0x0, server=server@entry=0x146814c70530 "mdsloader", op=op@entry=2,
retstatus=retstatus@entry=0x146808999338, lock=lock@entry=0x146808999358, conid_out=conid_out@entry=0x14680899933c,
callback_done=0x146815307b60 <action_done>, callback_param=0xa7, callback_before=0x146815306d50 <before>, numargs_in=3)
at /git/mdsplus/servershr/ServerSendMessage.c:339
#5 0x000014681530cdd7 in ServerDispatchAction (id=id@entry=0x0, server=0x146814c70530 "mdsloader", tree=0x146808992fb4 "D3D", shot=<optimized out>,
nid=<optimized out>, ast=ast@entry=0x146815307b60 <action_done>, astprm=0xa7, retstatus=0x146808999338, lock=0x146808999358,
socket=0x14680899933c, before_ast=0x146815306d50 <before>) at /git/mdsplus/servershr/ServerDispatchAction.c:67
#6 0x0000146815307ec6 in dispatch (i=i@entry=167) at /git/mdsplus/servershr/ServerDispatchPhase.c:201
#7 0x000014681530834c in ServerDispatchPhase (id=id@entry=0x0, vtable=<optimized out>, phasenam=phasenam@entry=0x1468080b36b0 "TEST",
noact_in=noact_in@entry=0 '\000', sync=sync@entry=1, output_rtn=output_rtn@entry=0x146814867740 <printIt>, monitor=0x0)
at /git/mdsplus/servershr/ServerDispatchPhase.c:536
#8 0x0000146814868521 in TclDispatch_phase (ctx=<optimized out>, error=0x146814c70c28, output=<optimized out>) at /git/mdsplus/tcl/tcl_dispatch.c:411
#9 0x00001468150f1a9a in dispatchToHandler (getlineinfo=0x0, getline=0x0, output=0x146814c70c20, error=0x146814c70c28, prompt=<optimized out>,
cmdDef=<optimized out>, cmd=0x1468080fdf60, image=<optimized out>) at /git/mdsplus/mdsdcl/cmdExecute.c:935
#10 processCommand (docList=docList@entry=0x146808011e50, verbNode_in=<optimized out>, cmd=cmd@entry=0x1468080065c0,
cmdDef=cmdDef@entry=0x146808006610, prompt=prompt@entry=0x146814c70c10, error=error@entry=0x146814c70c28, output=0x146814c70c20, getline=0x0,
getlineinfo=0x0) at /git/mdsplus/mdsdcl/cmdExecute.c:1165
#11 0x00001468150f30b3 in cmdExecute (cmd=0x1468080065c0, prompt_out=prompt_out@entry=0x0, error_out=error_out@entry=0x146814c70d70,
output_out=output_out@entry=0x146814c70d78, getline=getline@entry=0x0, getlineinfo=getlineinfo@entry=0x0) at /git/mdsplus/mdsdcl/cmdExecute.c:1453
#12 0x00001468150f6f04 in mdsdcl_do_command_extra_args (command=0x146810008db0 "DISPATCH/PHASE TEST", prompt=prompt@entry=0x0,
error=error@entry=0x146814c70d70, output=output@entry=0x146814c70d78, getline=getline@entry=0x0, getlineInfo=getlineInfo@entry=0x0)
at mdsdcl/yylex/cmdParse.y:213
#13 0x00001468150f2abb in mdsdcl_do_command (command=<optimized out>) at /git/mdsplus/mdsdcl/cmdExecute.c:1350
#14 0x0000146815304449 in DoSrvCommand (job_in=job_in@entry=0x1468100430b0) at /git/mdsplus/servershr/ServerQAction.c:642
#15 0x0000146815304b29 in WorkerThread (arg=<optimized out>) at /git/mdsplus/servershr/ServerQAction.c:786
#16 0x000014681808b1ca in start_thread () from /usr/lib64/libpthread.so.0
#17 0x0000146817cf7e73 in clone () from /usr/lib64/libc.so.6
(gdb)
EDIT:
Note that as of the time the following lsof
was pulled, those PIDs are for:
- dispatcher: systemd service
mds-8002
on pid3730021
- mdsloader: systemd service
mds-8003
on pid3730036
[root@atlas system]# ps -fp 3730021
UID PID PPID C STIME TTY TIME CMD
mdsadmin 3730021 1 0 10:19 ? 00:00:01 /usr/local/mdsplus/bin/mdsip -s -p 8002 -h /etc/mdsip.hosts -c 0
[root@atlas system]# ps -fp 3730036
UID PID PPID C STIME TTY TIME CMD
mdsadmin 3730036 1 0 10:19 ? 00:00:01 /usr/local/mdsplus/bin/mdsip -s -p 8003 -h /etc/mdsip.hosts -c 0
A current lsof -u mdsadmin | grep TCP
:
[root@atlas ~]# lsof -u mdsadmin | grep TCP
mdsip 3730021 mdsadmin mem REG 9,127 80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip 3730021 mdsadmin 3u IPv4 12579304 0t0 TCP *:8002 (LISTEN)
mdsip 3730021 mdsadmin 22u IPv4 12298036 0t0 TCP *:8801 (LISTEN)
mdsip 3730021 mdsadmin 28u IPv4 12428281 0t0 TCP atlas.gat.com:60656->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 102u IPv4 12298128 0t0 TCP atlas.gat.com:60622->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 103u IPv4 12298039 0t0 TCP atlas.gat.com:40034->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 104u IPv4 12410125 0t0 TCP atlas.gat.com:8801->atlas.gat.com:41562 (ESTABLISHED)
mdsip 3730021 mdsadmin 105u IPv4 12298131 0t0 TCP atlas.gat.com:60636->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 106u IPv4 12648855 0t0 TCP atlas.gat.com:51018->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 107u IPv4 12648858 0t0 TCP atlas.gat.com:51026->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 108u IPv4 12298136 0t0 TCP atlas.gat.com:60650->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 109u IPv4 12298160 0t0 TCP atlas.gat.com:52952->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 110u IPv4 12298181 0t0 TCP atlas.gat.com:49554->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 111u IPv4 12420686 0t0 TCP atlas.gat.com:54658->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 112u IPv4 12420672 0t0 TCP atlas.gat.com:49558->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 115u IPv4 12420676 0t0 TCP atlas.gat.com:57096->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 116u IPv4 12298188 0t0 TCP atlas.gat.com:57098->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 117u IPv4 12420679 0t0 TCP atlas.gat.com:57114->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 118u IPv4 12420682 0t0 TCP atlas.gat.com:57122->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 119u IPv4 12298192 0t0 TCP atlas.gat.com:54668->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 120u IPv4 12420690 0t0 TCP atlas.gat.com:38610->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 121u IPv4 12298196 0t0 TCP atlas.gat.com:38614->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 122u IPv4 12427862 0t0 TCP atlas.gat.com:43624->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 123u IPv4 12427868 0t0 TCP atlas.gat.com:43626->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 124u IPv4 12427871 0t0 TCP atlas.gat.com:43630->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 125u IPv4 12427877 0t0 TCP atlas.gat.com:43644->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 126u IPv4 12427884 0t0 TCP atlas.gat.com:43650->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 128u IPv4 12427961 0t0 TCP atlas.gat.com:47100->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 129u IPv4 12298215 0t0 TCP atlas.gat.com:47072->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 130u IPv4 12427957 0t0 TCP atlas.gat.com:47092->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 131u IPv4 12427954 0t0 TCP atlas.gat.com:47086->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 132u IPv4 12428079 0t0 TCP atlas.gat.com:47680->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 133u IPv4 12428055 0t0 TCP atlas.gat.com:33592->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 134u sock 0,9 0t0 12428052 protocol: TCP
mdsip 3730021 mdsadmin 135u IPv4 12428059 0t0 TCP atlas.gat.com:33608->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 136u IPv4 12428094 0t0 TCP atlas.gat.com:46216->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 137u IPv4 12428232 0t0 TCP atlas.gat.com:46218->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 138u IPv4 12428235 0t0 TCP atlas.gat.com:46228->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 139u IPv4 12428240 0t0 TCP atlas.gat.com:46234->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 140u IPv4 12648863 0t0 TCP atlas.gat.com:51030->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 142u IPv4 12428244 0t0 TCP atlas.gat.com:46250->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 143u IPv4 12428251 0t0 TCP atlas.gat.com:46266->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 144u IPv4 12428270 0t0 TCP atlas.gat.com:60642->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 145u IPv4 12728337 0t0 TCP atlas.gat.com:46278->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 146u IPv4 12728340 0t0 TCP atlas.gat.com:46282->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 147u IPv4 12728343 0t0 TCP atlas.gat.com:46292->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 148u IPv4 12428267 0t0 TCP atlas.gat.com:46306->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 149u IPv4 12428273 0t0 TCP atlas.gat.com:60654->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 153u IPv4 12740616 0t0 TCP atlas.gat.com:60674->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 154u IPv4 12428284 0t0 TCP atlas.gat.com:60666->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 155u IPv4 12428287 0t0 TCP atlas.gat.com:60670->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 157u IPv4 12728345 0t0 TCP atlas.gat.com:60684->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 158u IPv4 12715420 0t0 TCP atlas.gat.com:60694->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 159u IPv4 12715423 0t0 TCP atlas.gat.com:60710->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 160u IPv4 12715428 0t0 TCP atlas.gat.com:60720->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 161u IPv4 12715431 0t0 TCP atlas.gat.com:44726->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 162u IPv4 12715435 0t0 TCP atlas.gat.com:44736->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 163u IPv4 12401444 0t0 TCP atlas.gat.com:44744->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 164u IPv4 12401452 0t0 TCP atlas.gat.com:44762->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 165u IPv4 12401459 0t0 TCP atlas.gat.com:44772->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 166u IPv4 12401449 0t0 TCP atlas.gat.com:44760->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 167u IPv4 12401464 0t0 TCP atlas.gat.com:44788->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 168u IPv4 12401467 0t0 TCP atlas.gat.com:44804->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 169u IPv4 12715452 0t0 TCP atlas.gat.com:44806->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 170u IPv4 12715460 0t0 TCP atlas.gat.com:EtherNet/IP-2->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 171u IPv4 12715463 0t0 TCP atlas.gat.com:44832->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 172u IPv4 12401471 0t0 TCP atlas.gat.com:57696->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 173u IPv4 12401475 0t0 TCP atlas.gat.com:57708->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 174u IPv4 12401480 0t0 TCP atlas.gat.com:57712->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 175u IPv4 12401483 0t0 TCP atlas.gat.com:57720->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 177u IPv4 12401488 0t0 TCP atlas.gat.com:57734->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 178u IPv4 12401493 0t0 TCP atlas.gat.com:57750->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 179u IPv4 12401498 0t0 TCP atlas.gat.com:57752->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 180u IPv4 12401501 0t0 TCP atlas.gat.com:57760->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 181u IPv4 12401505 0t0 TCP atlas.gat.com:57768->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 182u IPv4 12401508 0t0 TCP atlas.gat.com:57782->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 183u IPv4 12715507 0t0 TCP atlas.gat.com:57784->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 184u IPv4 12715550 0t0 TCP atlas.gat.com:60928->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 185u IPv4 12715558 0t0 TCP atlas.gat.com:60946->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 186u IPv4 12715553 0t0 TCP atlas.gat.com:60932->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 188u IPv4 12715562 0t0 TCP atlas.gat.com:60948->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 189u IPv4 12715568 0t0 TCP atlas.gat.com:60964->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 190u IPv4 12715573 0t0 TCP atlas.gat.com:60966->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 191u IPv4 12401511 0t0 TCP atlas.gat.com:60980->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 192u IPv4 12401515 0t0 TCP atlas.gat.com:60994->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 193u IPv4 12401608 0t0 TCP atlas.gat.com:59376->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 194u IPv4 12401605 0t0 TCP atlas.gat.com:59374->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730026 mdsadmin mem REG 9,127 80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip 3730026 mdsadmin 3u IPv4 12568210 0t0 TCP *:8012 (LISTEN)
mdsip 3730036 mdsadmin mem REG 9,127 80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip 3730036 mdsadmin 3u IPv4 12568211 0t0 TCP *:8003 (LISTEN)
mdsip 3730036 mdsadmin 4u IPv4 12609768 0t0 TCP atlas.gat.com:8003->atlas.gat.com:40034 (ESTABLISHED)
mdsip 3730036 mdsadmin 61u IPv4 12628995 0t0 TCP atlas.gat.com:53526->atlas.gat.com:sunwebadmin (CLOSE_WAIT)
mdsip 3730036 mdsadmin 62u IPv4 12628998 0t0 TCP atlas.gat.com:41562->atlas.gat.com:8801 (ESTABLISHED)
mdsip 3730036 mdsadmin 63u IPv4 12609794 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60622 (ESTABLISHED)
mdsip 3730036 mdsadmin 64u IPv4 12609800 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60650 (ESTABLISHED)
mdsip 3730036 mdsadmin 65u IPv4 12230182 0t0 TCP atlas.gat.com:8003->atlas.gat.com:52952 (ESTABLISHED)
mdsip 3730036 mdsadmin 66u IPv4 12230189 0t0 TCP atlas.gat.com:8003->atlas.gat.com:49554 (ESTABLISHED)
mdsip 3730036 mdsadmin 67u IPv4 12230192 0t0 TCP atlas.gat.com:8003->atlas.gat.com:49558 (ESTABLISHED)
mdsip 3730036 mdsadmin 68u IPv4 12230203 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57096 (ESTABLISHED)
mdsip 3730036 mdsadmin 69u IPv4 12230209 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57114 (ESTABLISHED)
mdsip 3730036 mdsadmin 70u IPv4 12230212 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57122 (ESTABLISHED)
mdsip 3730036 mdsadmin 71u IPv4 12230250 0t0 TCP atlas.gat.com:8003->atlas.gat.com:54658 (ESTABLISHED)
mdsip 3730036 mdsadmin 72u IPv4 12230260 0t0 TCP atlas.gat.com:8003->atlas.gat.com:54668 (ESTABLISHED)
mdsip 3730036 mdsadmin 73u IPv4 12230263 0t0 TCP atlas.gat.com:8003->atlas.gat.com:38610 (ESTABLISHED)
mdsip 3730036 mdsadmin 74u IPv4 12230266 0t0 TCP atlas.gat.com:8003->atlas.gat.com:38614 (ESTABLISHED)
mdsip 3730036 mdsadmin 75u IPv4 12715064 0t0 TCP atlas.gat.com:8003->atlas.gat.com:43624 (ESTABLISHED)
mdsip 3730036 mdsadmin 76u IPv4 12715069 0t0 TCP atlas.gat.com:8003->atlas.gat.com:43626 (ESTABLISHED)
mdsip 3730036 mdsadmin 77u IPv4 12715076 0t0 TCP atlas.gat.com:8003->atlas.gat.com:43644 (ESTABLISHED)
mdsip 3730036 mdsadmin 78u IPv4 12715079 0t0 TCP atlas.gat.com:8003->atlas.gat.com:43650 (ESTABLISHED)
mdsip 3730036 mdsadmin 79u IPv4 12230269 0t0 TCP atlas.gat.com:8003->atlas.gat.com:47072 (ESTABLISHED)
mdsip 3730036 mdsadmin 80u IPv4 12715164 0t0 TCP atlas.gat.com:8003->atlas.gat.com:47086 (ESTABLISHED)
mdsip 3730036 mdsadmin 81u IPv4 12715167 0t0 TCP atlas.gat.com:8003->atlas.gat.com:47092 (ESTABLISHED)
mdsip 3730036 mdsadmin 82u IPv4 12715170 0t0 TCP atlas.gat.com:8003->atlas.gat.com:47100 (ESTABLISHED)
mdsip 3730036 mdsadmin 83u IPv4 12582098 0t0 TCP atlas.gat.com:8003->atlas.gat.com:33592 (ESTABLISHED)
mdsip 3730036 mdsadmin 84u IPv4 12582101 0t0 TCP atlas.gat.com:8003->atlas.gat.com:33608 (ESTABLISHED)
mdsip 3730036 mdsadmin 85u IPv4 12582112 0t0 TCP atlas.gat.com:8003->atlas.gat.com:47680 (ESTABLISHED)
mdsip 3730036 mdsadmin 86u IPv4 12582115 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46216 (ESTABLISHED)
mdsip 3730036 mdsadmin 87u IPv4 12582118 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46218 (ESTABLISHED)
mdsip 3730036 mdsadmin 88u IPv4 12582124 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46234 (ESTABLISHED)
mdsip 3730036 mdsadmin 89u IPv4 12582127 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46250 (ESTABLISHED)
mdsip 3730036 mdsadmin 90u IPv4 12582130 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46266 (ESTABLISHED)
mdsip 3730036 mdsadmin 91u IPv4 12625488 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46278 (ESTABLISHED)
mdsip 3730036 mdsadmin 92u IPv4 12582136 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60642 (ESTABLISHED)
mdsip 3730036 mdsadmin 93u IPv4 12582139 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60654 (ESTABLISHED)
mdsip 3730036 mdsadmin 94u IPv4 12582142 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60656 (ESTABLISHED)
mdsip 3730036 mdsadmin 95u IPv4 12582151 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60674 (ESTABLISHED)
mdsip 3730036 mdsadmin 96u IPv4 12582154 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60694 (ESTABLISHED)
mdsip 3730036 mdsadmin 97u IPv4 12582159 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60710 (ESTABLISHED)
mdsip 3730036 mdsadmin 98u IPv4 12582162 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60720 (ESTABLISHED)
mdsip 3730036 mdsadmin 99u IPv4 12582165 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44726 (ESTABLISHED)
mdsip 3730036 mdsadmin 100u IPv4 12582168 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44736 (ESTABLISHED)
mdsip 3730036 mdsadmin 101u IPv4 12625500 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44744 (ESTABLISHED)
mdsip 3730036 mdsadmin 102u IPv4 12625504 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44760 (ESTABLISHED)
mdsip 3730036 mdsadmin 103u IPv4 12625510 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44772 (ESTABLISHED)
mdsip 3730036 mdsadmin 104u IPv4 12625513 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44788 (ESTABLISHED)
mdsip 3730036 mdsadmin 105u IPv4 12625516 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44804 (ESTABLISHED)
mdsip 3730036 mdsadmin 106u IPv4 12582173 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44806 (ESTABLISHED)
mdsip 3730036 mdsadmin 107u IPv4 12582177 0t0 TCP atlas.gat.com:8003->atlas.gat.com:EtherNet/IP-2 (ESTABLISHED)
mdsip 3730036 mdsadmin 108u IPv4 12625519 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57696 (ESTABLISHED)
mdsip 3730036 mdsadmin 109u IPv4 12625522 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57708 (ESTABLISHED)
mdsip 3730036 mdsadmin 110u IPv4 12625525 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57712 (ESTABLISHED)
mdsip 3730036 mdsadmin 111u IPv4 12625528 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57720 (ESTABLISHED)
mdsip 3730036 mdsadmin 112u IPv4 12625531 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57734 (ESTABLISHED)
mdsip 3730036 mdsadmin 113u IPv4 12625534 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57750 (ESTABLISHED)
mdsip 3730036 mdsadmin 114u IPv4 12625537 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57752 (ESTABLISHED)
mdsip 3730036 mdsadmin 115u IPv4 12625543 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57768 (ESTABLISHED)
mdsip 3730036 mdsadmin 116u IPv4 12625546 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57782 (ESTABLISHED)
mdsip 3730036 mdsadmin 117u IPv4 12514018 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57784 (ESTABLISHED)
mdsip 3730036 mdsadmin 118u IPv4 12514021 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60928 (ESTABLISHED)
mdsip 3730036 mdsadmin 119u IPv4 12514024 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60932 (ESTABLISHED)
mdsip 3730036 mdsadmin 120u IPv4 12514027 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60946 (ESTABLISHED)
mdsip 3730036 mdsadmin 122u IPv4 12514031 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60948 (ESTABLISHED)
mdsip 3730036 mdsadmin 123u IPv4 12609797 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60636 (ESTABLISHED)
mdsip 3730036 mdsadmin 124u IPv4 12514034 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60964 (ESTABLISHED)
mdsip 3730036 mdsadmin 125u IPv4 12514037 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60966 (ESTABLISHED)
mdsip 3730036 mdsadmin 126u IPv4 12625549 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60980 (ESTABLISHED)
mdsip 3730036 mdsadmin 127u IPv4 12625552 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60994 (ESTABLISHED)
mdsip 3730036 mdsadmin 128u IPv4 12625561 0t0 TCP atlas.gat.com:8003->atlas.gat.com:59374 (ESTABLISHED)
mdsip 3730036 mdsadmin 129u IPv4 12230206 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57098 (ESTABLISHED)
mdsip 3730036 mdsadmin 130u IPv4 12625564 0t0 TCP atlas.gat.com:8003->atlas.gat.com:59376 (ESTABLISHED)
mdsip 3730036 mdsadmin 131u IPv4 12401620 0t0 TCP atlas.gat.com:8003->atlas.gat.com:51018 (ESTABLISHED)
mdsip 3730036 mdsadmin 132u IPv4 12401626 0t0 TCP atlas.gat.com:8003->atlas.gat.com:51030 (ESTABLISHED)
mdsip 3730036 mdsadmin 138u IPv4 12715072 0t0 TCP atlas.gat.com:8003->atlas.gat.com:43630 (ESTABLISHED)
mdsip 3730036 mdsadmin 150u IPv4 12582121 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46228 (ESTABLISHED)
mdsip 3730036 mdsadmin 153u IPv4 12582145 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60666 (ESTABLISHED)
mdsip 3730036 mdsadmin 155u IPv4 12625491 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46282 (ESTABLISHED)
mdsip 3730036 mdsadmin 156u IPv4 12625494 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46292 (ESTABLISHED)
mdsip 3730036 mdsadmin 157u IPv4 12582133 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46306 (ESTABLISHED)
mdsip 3730036 mdsadmin 160u IPv4 12582148 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60670 (ESTABLISHED)
mdsip 3730036 mdsadmin 164u IPv4 12625497 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60684 (ESTABLISHED)
mdsip 3730036 mdsadmin 172u IPv4 12625507 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44762 (ESTABLISHED)
mdsip 3730036 mdsadmin 178u IPv4 12582180 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44832 (ESTABLISHED)
mdsip 3730036 mdsadmin 186u IPv4 12625540 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57760 (ESTABLISHED)
mdsip 3730036 mdsadmin 201u IPv4 12401623 0t0 TCP atlas.gat.com:8003->atlas.gat.com:51026 (ESTABLISHED)
mdsip 3767135 mdsadmin mem REG 9,127 80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip 3767135 mdsadmin 0u IPv4 12230301 0t0 TCP atlas.gat.com:mdsip->iris29.gat.com:59836 (ESTABLISHED)
mdsip 3771645 mdsadmin mem REG 9,127 80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip 3771645 mdsadmin 0u IPv4 12689154 0t0 TCP atlas.gat.com:mdsip->iris29.gat.com:60068 (ESTABLISHED)
Systemd contents:
mds-8002
[Unit]
Description=MDSplus Dispatcher Service
After=network.target
[Service]
User=mdsadmin
Type=simple
LimitNOFILE=64000
LimitCORE=infinity
ExecStart=/usr/local/mdsplus/bin/mdsipd-dispatching 8002 -block
[Install]
WantedBy=multi-user.target
mds-8003:
[Unit]
Description=MDSplus Loader Service
After=network.target
[Service]
Type=simple
LimitNOFILE=64000
LimitCORE=infinity
ExecStart=/usr/local/mdsplus/bin/mdsipd-dispatching 8003 -block
[Install]
WantedBy=multi-user.target
OS (atlas.gat.com):
[root@atlas system]# cat /etc/*-release
NAME="Red Hat Enterprise Linux"
VERSION="8.8 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.8 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"
Red Hat Enterprise Linux release 8.8 (Ootpa)
Red Hat Enterprise Linux release 8.8 (Ootpa)
MDSplus Version:
[root@atlas system]# rpm -qa | grep mds
mdsplus-alpha-mssql-7.139-59.el8.x86_64
mdsplus-alpha-idl_bin-7.139-59.el8.x86_64
mdsplus-alpha-kernel_bin-7.139-59.el8.x86_64
mdsplus-alpha-d3d-7.139-59.el8.noarch
mdsplus-alpha-kernel-7.139-59.el8.noarch
mdsplus-alpha-motif-7.139-59.el8.noarch
mdsplus-alpha-repo-7.139-40.el8.noarch
mdsplus-alpha-motif_bin-7.139-59.el8.x86_64
mdsplus-alpha-python-7.139-59.el8.noarch
mdsplus-alpha-idl-7.139-59.el8.noarch
To explain mdsipd-dispatching
:
#!/bin/sh -f
#
# MDSplus server daemon procedure
#
# Parameters:
#
# $1 = service or portnumber
if [ $# -lt 1 ]
then
echo ""
echo "Correct usage is $0 portnumber [-block]"
echo example: $0 8001
echo ""
exit
fi
MDSPLUS_DIR=/usr/local/mdsplus
# Note that we source the setup-server.sh file because this is to be a server
# process, not a regular data server process.
. $MDSPLUS_DIR/setup-server.sh
logfile=/var/log/mdsplus/mdsip-server/$1.log
echo "" >> $logfile 2>> $logfile
echo "Telling any existing server on this port to shut down..." >> $logfile 2>> $logfile
/usr/local/mdsplus/bin/mdstcl stop server localhost:$1 >> $logfile 2>> $logfile
echo "" >> $logfile 2>> $logfile
echo "############# END OF OLD LOGFILE #############" >> $logfile 2>> $logfile
echo "" >> $logfile 2>> $logfile
echo "############ START OF NEW LOGFILE ############" >> $logfile 2>> $logfile
echo "Starting mdsip -s on port "$1" at "`date` >> $logfile 2>> $logfile
echo "--------------------" >> $logfile 2>> $logfile
echo "Showing ulimit -a" >> $logfile 2>> $logfile
ulimit -a >> $logfile 2>> $logfile
echo "" >> $logfile 2>> $logfile
echo "--------------------" >> $logfile 2>> $logfile
echo "" >> $logfile 2>> $logfile
# if -block is specified, then don't run in the background
if [ $# -gt 1 ]
then
if [ $2 = "-block" ]
then
exec $MDSPLUS_DIR/bin/mdsip -s -p $1 -h /etc/mdsip.hosts -c 0 >> $logfile 2>> $logfile
exit 0
fi
fi
exec $MDSPLUS_DIR/bin/mdsip -s -p $1 -h /etc/mdsip.hosts -c 0 >> $logfile 2>> $logfile &
Please note that the error mentioned occurs after ~1000 actions nodes triggered.
This happens whether it its 1k actions in a single shot, or across multiple shots.
Thanks for all the info.
A quick workaround:
add
Restart=always
to you service files
after the last dispatch /phase
add exit
Hi @sflanagan -- Josh has succeeded in reproducing the issue with the current alpha on Ubuntu. We are now debugging the problem. And will post additional updates here later this afternoon.
Took some time to configure my dev environment, but I am now also able to reproduce the bug. So will attach the debugger and figure out how the code is leaving dangling sockets.
Progress has been made on debugging the issue, but still more to do to locate the root cause of the dangling sockets. Have provided my initial findings to Josh and Stephen for review and suggestions.
@mwinkel-dev Glad you were able to reproduce. Did the workaround proposed by @joshStillerman work?
Hi @smithsp and @sflanagan -- I will test Josh's workaround right now. I'm sure it will work.
I have been manually killing the mdsip
process between experiments and that of course cleans up the dangling sockets. Nonetheless, you have a very good point that I should also automate that by using Josh's workaround (which kills the mdsip
at the end of the shot and relies on systemd
to restart the process).
Current investigation shows that the dangling sockets gradually grow until the process that launches the actions (such as TCL
exits). When the process exits, the sockets are cleaned up. After I test Josh's workaround, I will resume looking for a fix that kills the sockets when each action thread finishes, instead of waiting until the entire process exits.
Hi @smithsp and @sflanagan -- I have tested Josh's recommended workaround and it is A-OK.
Here are the details of the experiment . . .
- There are two action nodes in a dinky tree named
zzz
. Each action node just sleeps for 10 seconds. - The
mdsip
access to thezzz_<shot>
trees is on the usual port 8000. - The
mdsip
process for actions (aka action server) is on port 8100 and configured as asystemd
service. - The
systemd
service file is similar to GA's, but also hasRestart=always
in the[Service]
section. - Then these
TCL
commands are executed (as per Josh's reproduce case).
TCL> set tree zzz
TCL> create pulse 1
TCL> set tree zzz /shot=1
TCL> dispatch /build
TCL> dispatch/phase/log init
- After ~20 seconds, control returns to the
TCL>
prompt. - In another terminal window, run
lsof
and grep for the action server on port 8100. Note the dangling TCP sockets. (They do get cleaned up if exitTCL
. However for this example we wish to remain inTCL
.)
mdsip 3899986 mdsplus mem REG 8,5 102144 31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip 3899986 mdsplus 3u IPv4 30142804 0t0 TCP *:8100/tcp (LISTEN)
mdsip 3899986 mdsplus 6u IPv4 30155996 0t0 TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip 3899986 mdsplus 7u IPv4 30144205 0t0 TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip 3899986 mdsplus 9u IPv4 30156000 0t0 TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
mdsip 3899986 3900283 mdsip mdsplus mem REG 8,5 102144 31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip 3899986 3900283 mdsip mdsplus 3u IPv4 30142804 0t0 TCP *:8100/tcp (LISTEN)
mdsip 3899986 3900283 mdsip mdsplus 6u IPv4 30155996 0t0 TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip 3899986 3900283 mdsip mdsplus 7u IPv4 30144205 0t0 TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip 3899986 3900283 mdsip mdsplus 9u IPv4 30156000 0t0 TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
mdsip 3899986 3900284 mdsip mdsplus mem REG 8,5 102144 31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip 3899986 3900284 mdsip mdsplus 3u IPv4 30142804 0t0 TCP *:8100/tcp (LISTEN)
mdsip 3899986 3900284 mdsip mdsplus 6u IPv4 30155996 0t0 TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip 3899986 3900284 mdsip mdsplus 7u IPv4 30144205 0t0 TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip 3899986 3900284 mdsip mdsplus 9u IPv4 30156000 0t0 TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
mdsip 3899986 3900285 mdsip mdsplus mem REG 8,5 102144 31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip 3899986 3900285 mdsip mdsplus 3u IPv4 30142804 0t0 TCP *:8100/tcp (LISTEN)
mdsip 3899986 3900285 mdsip mdsplus 6u IPv4 30155996 0t0 TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip 3899986 3900285 mdsip mdsplus 7u IPv4 30144205 0t0 TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip 3899986 3900285 mdsip mdsplus 9u IPv4 30156000 0t0 TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
- Then at the
TCL>
prompt, type this:TCL> stop server localhost:8100
- That kills the current action server.
- Whereupon
systemd
automatically restarts the service. It remains on port 8100, but of course has a new process ID (i.e., compare the following process ID with the one above).
root@mfews-mwinkel:/home/mwinkel/x_markw/config# ps -ef | fgrep mdsip
mdsplus 3900338 1 0 18:57 ? 00:00:00 /usr/local/xmw_mds/mdsplus/bin/mdsip -s -p 8100 -h /tmp/xmw_ga_2731/mdsip.hosts -c 9
- And as expected,
lsof
shows that the new process has just a single TCP socket listening on port 8100.
root@mfews-mwinkel:/home/mwinkel/x_markw/config# lsof | fgrep 3900338 | fgrep TCP
mdsip 3900338 mdsplus mem REG 8,5 102144 31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip 3900338 mdsplus 3u IPv4 30160993 0t0 TCP *:8100/tcp (LISTEN)
- While still at the
TCL>
prompt, run a new shot (i.e., repeat all the steps listed above). And observe the same behavior.
In short, Josh's workaround works fine on my Ubuntu 20 development system.
Let us know if you have any problems with this workaround on GA's RHEL8 server.
Hi @smithsp and @sflanagan -- Root cause of the dangling sockets in mdstcl
has been found. When "actions" complete, there is a coordination problem between two threads. Conceptually, one thread attempts to delete the associated socket, but can't because the other thread still has a reference to it. (That is a generalization; the details are a bit different.)
It will take us a few days to fix the issue (i.e., iterate on the design, manual testing, and peer review).
While investigating this issue, I have also been using the systemd workaround (so I have a clean slate for each experiment). The workaround is working fine on my development system.
Let me know if you have any problems with the systemd workaround on GA's server.
Addendum This post (and many of the following posts) were initial conjectures and thus wrong. For the final summary of the investigation, refer to the post at this link. https://github.com/MDSplus/mdsplus/issues/2731#issuecomment-2067188950
Hi @sflanagan -- How many "actions" are issued when running a single DIII-D shot? And how many "action servers" are involved? And how many shots per day?
Would like that information so that I can design a test case that is many times more load on the "action server" than the DIII-D load.
Assuming, as an upper bound, that everything is enabled and ops is having a long and/or productive run day...
How many "actions" are issued when running a single DIII-D shot?
~300
And how many "action servers" are involved?
1
And how many shots per day?
~50
Let me know if you have any problems with the systemd workaround on GA's server.
I redesigned my mds-8002
systemd file to add:
User=mdsadmin
Restart=always
ExecStartPost=/usr/local/mdsplus/local/mds-8002-post.sh
Where the mds-8002-post.sh
just automatically sets up the dispatch table, shot number(s), opens the relevant shot tree(s), etc... and then triggers a "test" action node.
Seems "fine" as a temporary workaround, but it does leave me open to skipping/missing a shot if the dispatcher service is restarting while the DAQ system is trying to start a new shot cycle. Trying to reduce that risk is next up on my agenda (if I can).
Hi @sflanagan -- Thanks for the information about the number of actions in a DIII-D shot. And for the news regarding the systemd
workaround.
Now have a prototype fix. Next step is to do some manual stress testing, followed by peer review. Will take a few more days before there is a final fix that will be merged to alpha and cherry-picked to the GA branch.
Hi @smithsp and @sflanagan,
Summary This post is a summary of recent emails with @smithsp, plus provides additional details.
This problem was introduced with PR #2288 on 19-May-2021. The bug appears to only affect the dispatch
command of TCL
. Thus, it is likely that the associated fix is low risk. (Testing is presently underway to confirm that is indeed true.)
Details
When mdstcl
(which is a client) dispatches an action, it also creates a "receiver" thread to receive the eventual status reply regarding that action. (The action is sent to an "action server" that then processes the action and sends a reply back to the mdstcl
client.) For each action that is dispatched, mdstcl
thus opens up two sockets: one is the "to" socket used to send the action to the action server, and the other is the "from" socket used to receive the reply from the action server.
The issue is that the "connection ID table" in the mdstcl
client should be accessible by all threads, but unfortunately isn't. Thus, when the receiver thread receives the SrvJobFINISHED
reply, it can close its "from" socket, but is unable to access the main thread's "connection ID table" in order to close the "to" socket. Thus, for each action dispatched, the current code leaves a dangling "to" socket.
The two workarounds clean up the dangling "to" sockets. Adding an "exit" statement at the end of a TCL
script that dispatches actions terminates the TCL
process thus all of its sockets are closed. Alternatively, using the stop server
command of TCL will cause the action server to reboot (if systemd or xinetd is configured correctly) which also closes all dangling sockets.
The proposed fix does not alter any logic regarding sockets and connections. All that it does is allow the "receiver" thread to access the "connection ID table" so that the "to" socket can be closed. Preliminary testing confirms that the fix works and that mdstcl
is no longer leaking a "to" socket for each dispatched action. Thus, the workarounds are no longer needed.
Low Risk? Whenever fiddling with multi-threaded code, there is always a risk of race conditions being created. Hence the focus on extensive testing of the fix prior to release to GA. The goal of the testing is to confirm that all critical sections are protected with locks. Initial testing is promising; no threading issues have been encountered.
And because the problem appears to be limited to the dispatch
command of mdstcl
, that also reduces the risk associated with the fix. It is unlikely to affect other aspects of the MDSplus suite of tools.
Note also that this bug has been present in the alpha
branch for ~3 years, and that no other customers have reported problems. That tends to confirm that the problem doesn't affect the entire socket communication foundation of MDSplus. And likely also means that the fix will have lower risks than expected. (Regardless, extensive testing is needed to confirm that is so.)
Clarification
This issue affects all commands sent to an "action server". That includes the following activities (i.e., the list gives descriptions; the actual TCL
statements are likely different).
- abort server
- close trees
- create pulse
- dispatch action
- dispatch command
- monitor check-in
- monitor
- logging
- stop server
Hi @sflanagan -- Questions about your site's single "action server" . . .
- What version of MDSplus does it run?
- What Operating System and version?
- Does the "action server" MDSplus/OS version match that of the "dispatch server" (apparently Atlas)?
- Is it a dedicated "action server" (only processes) actions? (Am curious if it is used for other MDSplus activities such as data analysis and/or to run non-MDSplus applications.)
Answers to the above questions will enable us to set up a cross-version testing environment, if needed.
A stress test of 1,200 actions (400 actions per shot, 3 shots) reveals that there is an unprotected critical section that can cause a segfault. The segfault also occurs in the alpha
branch thus was not introduced by the proposed fix.
Adding a delay of 0.01 seconds to each action eliminates the segfault, but causes the proposed fix to leave ~10 dangling sockets (i.e., it closes the other 1,190 "to" sockets). Additional investigation is required to determine if the dangling sockets are caused by a second unprotected critical section. And if so, whether it was introduced with the proposed fix.
Increasing the delay to 0.1 seconds allows the proposed fix to work fine -- no dangling sockets.
The proposed fix successfully handled 20,000 actions without leaking sockets. (However, each action had a 0.1 second delay.) That was 50 back-to-back shots with 400 actions per shot. The test was designed to generate a spike load on the "action server".
An even more extreme stress test does create a segfault in both the current alpha
and the proposed fix. Preliminary investigation indicates that the segfault is probably a pre-existing issue and was not introduced by the proposed fix. The segfault investigation continues (and perhaps will be tracked in a new GitHub issue).
Summary is that the proposed fix is now working well enough that it is moving on to the peer review stage. However, it will still be a few days before the fix is ready for release to GA.
Thanks for the update @mwinkel-dev .
Took all of Saturday to figure out that there is a "race" condition associated with this fix.
When mdstcl
dispatches an action to the "action server", the server sends back two messages: "handshake" and "done". And they are received by two different threads.
- The "handshake" message is when the server acknowledges that it has received an action.
- The "done" message is sent when the server has processed the action.
Note: The names, "handshake" and "done", are just for clarity in this post. Those names don't appear in the source code.
In practice, there is usually a fair amount of time between the two messages (e.g., it takes time for an action to configure a digitizer). And that time interval enables the code to process the "handshake" before the "done".
However, the spike load test has an action that can be done instantly (i.e., just evaluating the number 1
). And thus on a heavily loaded action server, the "done" message can be received and processed by mdstcl
before the "handshake" is fully processed. Because processing the "done" message deletes the associated connection, the "handshake" code gets mighty confused when its connection id gets clobbered. Hence the errors and occasional segfaults.
Conceptually, the ideal fix to avoid the "race" condition is to use a lock on every single connection object. However, it will likely be a challenge to retrofit that into the existing code. (There is a risk of subtle bugs creeping in that could result in deadlock.)
The practical fix is probably just to add 100ms delay before processing the "done" message and closing the connection. Experiments prove that this simple approach almost always avoids the race condition. (It leaked 2 or 3 sockets out of 400 actions processed.)
However, it is important that the dispatch cycle roughly match the added delay, so that connections are created and destroyed at approximately the same rate. If the delay is too long, and there are hundreds of actions dispatched at a high rate, then all sockets (allowed by the operating system) can be consumed thereby causing mdstcl
and/or the action server to crash.
An alternative concept for matching connection creation / destruction rates is for mdstcl
to keep a running tally of open connections. And if the count hits a user specified threshold (e.g., 500 active connections max), then it pauses dispatching actions. As actions are drained from the server's input queue, mdstcl
will close connections accordingly. And thus as each action finishes, mdstcl
will be able to dispatch another action.
Combining both approaches would allow mdstcl
to avoid both problems: race condition and no more sockets.
There is a fatal flaw in the proposed fix (call it "Option A") described in the preceding posts: deadlock.
The existing architecture of the mdstcl dispatch
feature consists of the following:
- a "connection id" table that contains information about each connection (sockets used, connection state, and so forth),
- a main thread that also dispatches actions, and
- many "receiver" threads, one per each action sent to the action server.
Option A treats the "connection id" table as a resource shared by all threads. And thus locks are used to protect access when adding or deleting connections. However, in order for that to work well, the main thread needs to be throttled so that the receiver threads have an opportunity to access the table too. Note though that if a receiver thread obtains the lock and then dies before releasing the lock, that the main thread is thus permanently blocked from dispatching more actions. Thus mdstcl
deadlocks. (There are likely other edge cases that would also create deadlock.)
Although the principles and architectures for high-performance, robust, multi-threaded, web servers and clients are now well known, the challenge is in retrofitting the existing action server architecture to accommodate those features.
Option B might be a more practical and less risky option.
It has these features . . .
- the main dispatch thread passes a connection's socket to the associated receiver thread
- thus the receiver threads can close sockets without having to access the "connection id" table
- that eliminates the need for locking on the table (because it is only accessed by the main thread)
- it only allows ~N active connections at a time, where N is chosen to be well below the number of sockets allowed by the operating system,
- N is user configurable (likely via an environment variable)
Advantages:
- doesn't immediately consume all available sockets when dispatching hundreds of actions
- doesn't leak sockets
- avoids deadlock
Disadvantages:
- when the action server is finished processing an action, its entry in the "connection id" table becomes obsolete
- thus the "connection id" table constantly grows
Analysis:
- suppose that 20,000 actions are dispatched in a day
- the memory required to store a "connection id" table of 20,000 entries is fairly small -- probably just a few MB
- the growth of the table is highly unlikely to force a reboot of the server (as is the case if run out of sockets)
- a cron job can be run each evening (when shots aren't being run) to restart the action server, thus each morning the action server starts with an empty "connection id" table
- the current shot's connections are added to the head of the table, thus table growth during the day won't have much impact on the speed of table lookups or adding new entires
- this option involves little change to the existing code, plus avoids deadlock, thus is less risky than Option A
Summary:
- The long-term solution is to write a new dispatcher using a more robust architecture.
- However to meet the schedule of DIII-D's upcoming campaign, the faster and safest approach is to make minimal changes to the existing
mdstcl dispatch
feature. - Option B thus shows more promise than Option A.
Now have a rough prototype of Option B. It dispatched 3,600 actions at full speed and only leaked one socket. The Linux top
command shows a negligible memory footprint (i.e., as expected the growth of the "connection id" table is insignificant).
On Monday, will clean up the code and do more extensive testing. And if the results look good, will move Option B to the peer review stage of the development process.
Option B prototype dispatched 21,600 actions with just 35 sockets leaked (a leak rate of ~0.2%). Actions were dispatched at the rate of 10 actions per second. Rough estimate is that the "connection ID" table grew to ~187 MB.
However, if remove the dispatch throttle of 10 actions/second, then mdstcl
can deadlock when dispatching actions. That is a latent bug that has surely existed in the code for years; it is unrelated to the fix for this socket issue. (Nonetheless, it will be investigated before this socket fix is released to GA.)
When mdstcl
deadlocked during testing, usually did not also examine the "action server". However, the one instance when did check revealed that the "action server" remained active (i.e., opened a new terminal window, started another mdstcl
session and dispatched more actions). When additional mdstcl
deadlocks are observed, more tests will be done to determine if the "action server" is always unaffected.
There is likely another mode that causes mdstcl dispatch
to leak sockets. It is associated with the queue length on the "action server".
Here are the details . . .
Because mdstcl
dispatches actions and processes the associated "done" messages, it can estimate the number of actions that are queued up on the "action server". (For each queued action, two sockets are used.).
Configured mdstcl
so it would maintain ~20 actions in the server's queue. And then dumped 400 actions into the dispatch hopper. It quickly ramped up to ~20 dispatched actions, and thereafter maintained steady state. When an action finished, it would dispatch another action and maintain the tally at ~20 dispatched-and-queued actions. The logs for mdstcl
and the "action server" both indicated all 400 actions were processed. So was surprised to see that lsof
reported ~20 leaked sockets.
Repeated the experiment with the limit set at ~100 actions. And observed that ~100 sockets leaked.
On Tuesday, will dig into this mystery. (There is a chance that there is a flaw in the design / implementation of the experiment.)
This problem was introduced with PR #2288 on 19-May-2021. The bug appears to only affect the
dispatch
command ofTCL
. Thus, it is likely that the associated fix is low risk. (Testing is presently underway to confirm that is indeed true.)
I slowly remember the struggle we had sorting out the existing problem at the time while keeping backward compatibility. The issue was that the old protocol required a reconnection on an offered port to send the result of the action.
the original goals that i remember were to support:
- cancelation of dispatched actions by event (e.g. user intervention, Ctrl+C)
- timeout
- parallel execution of actions
Hi @zack-vii -- Thanks for the history of the change. Much appreciated!
I will expand my experiments to see if my proposed fix for the leaking sockets breaks any of the features you listed above.