mdsplus icon indicating copy to clipboard operation
mdsplus copied to clipboard

MDSplus dispatcher opens lots of files

Open smithsp opened this issue 11 months ago • 31 comments

Affiliation GA/DIII-D

Version(s) Affected Our production server, atlas, currently suffers this bug. @sflanagan will provide the version at some point.

Platform RHEL8

Describe the bug The MDSplus dispatcher leaves open files, filling up the limit of open files. Somehow the limit of number of open files is not honored by the OS or dispatcher, even after attempting to increase the limit in various ways in various configuration files.

To Reproduce Our production server, atlas, currently suffers this bug. Details to be provided at some point by @sflanagan .

Expected behavior I expect that the dispatcher will close its own files as it goes, without accumulating open files.

Additional context Any fix provided needs to be on a branch starting from the version that is currently on atlas.

smithsp avatar Mar 26 '24 13:03 smithsp

I think you need to add dispatch/close to the script that is doing the dispatch/build.

TCL> help dispatch/close

  Command: DISPATCH/CLOSE
  Purpose: Instruct MDSplus actions server(s) to close all open trees.
  Format: DISPATCH/CLOSE[/SERVER=(server1,server2...)]

  Description:

  The DISPATCH/CLOSE can be used to instruct an MDSplus action server to close all MDSplus
  trees the server currently has open. If the /SERVER qualifier is not included to specify
  any action servers then the current dispatch table (See: DISPATCH/BUILD and
  DISPATCH/PHASE commands) is inspected to construct a list of servers that had actions
  dispatched to them to send close instructions to. The DISPATCH/CLOSE command is typically
  used at the end of a data acquisition cycle to instruct the action server to close
  any open trees.

TCL> 

I am happy to meet this afternoon to discuss...

joshStillerman avatar Mar 26 '24 14:03 joshStillerman

According to Mitchell's post on another issue, Atlas has been running alpha-7.139.59 since November 2023. And that Atlas is on RHEL8. Issue 2704, post of 13-Feb-2024

mwinkel-dev avatar Mar 26 '24 14:03 mwinkel-dev

There's likely more to this than open files, as enabling core dumps against the systemd services running the "disaptcher" (i.e. mdsip -s on atlas:8002) and the "mdsloader" (i.e. mdsip -s on atlas:8003) comes up with some potential issues in mdstcpip/mdsipshr/ConnectToMds.c:?

I haven't looked at this too closely, but for the sake of notes I'll just quickly post the coredumpctl gdb <pid>; bt of the last core dump produced (this one is from the port 8002 side):

[root@atlas ~]# coredumpctl gdb 865034
           PID: 865034 (mdsip)
           UID: 117 (mdsadmin)
           GID: 511 (mdsdev)
        Signal: 11 (SEGV)
     Timestamp: Mon 2024-03-25 21:44:26 PDT (12h ago)
  Command Line: /usr/local/mdsplus/bin/mdsip -s -p 8002 -h /etc/mdsip.hosts
    Executable: /usr/local/mdsplus/bin64/mdsip
 Control Group: /
         Slice: -.slice
       Boot ID: 7392c4d0130e43a9a07fbde2f35cd9c7
    Machine ID: 1a91ee619a7f42d0b73da694e83b0be3
      Hostname: atlas.gat.com
       Storage: /var/lib/systemd/coredump/core.mdsip.117.7392c4d0130e43a9a07fbde2f35cd9c7.865034.1711428266000000.lz4
       Message: Process 865034 (mdsip) of user 117 dumped core.

                Stack trace of thread 866623:
                #0  0x00001468186e2c61 do_login (libMdsIpShr.so)
                #1  0x000014681530a01e server_connect (libMdsServerShr.so)
                #2  0x000014681530ba91 ServerConnect (libMdsServerShr.so)
                #3  0x000014681530bc57 ServerSendMessage (libMdsServerShr.so)
                #4  0x000014681530cdd7 ServerDispatchAction (libMdsServerShr.so)
                #5  0x0000146815307ec6 dispatch (libMdsServerShr.so)
                #6  0x000014681530834c ServerDispatchPhase (libMdsServerShr.so)
                #7  0x0000146814868521 TclDispatch_phase (libtcl_commands.so)
                #8  0x00001468150f1a9a dispatchToHandler (libMdsdcl.so)
                #9  0x00001468150f30b3 cmdExecute (libMdsdcl.so)
                #10 0x00001468150f6f04 mdsdcl_do_command_extra_args (libMdsdcl.so)
                #11 0x00001468150f2abb mdsdcl_do_command (libMdsdcl.so)
                #12 0x0000146815304449 DoSrvCommand (libMdsServerShr.so)
                #13 0x0000146815304b29 WorkerThread (libMdsServerShr.so)
                #14 0x000014681808b1ca start_thread (libpthread.so.0)
                #15 0x0000146817cf7e73 __clone (libc.so.6)

                Stack trace of thread 865034:
                #0  0x0000146817de4a3f __select (libc.so.6)
                #1  0x0000146816d8e877 listen_loop (libMdsIpTCP.so)
                #2  0x0000146816d8ecc6 io_listen (libMdsIpTCP.so)
                #3  0x00000000004008e8 main (mdsip)
                #4  0x0000146817cf8d85 __libc_start_main (libc.so.6)
                #5  0x000000000040097e _start (mdsip)

                Stack trace of thread 866830:
                #0  0x000014681809145c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x0000146815306f07 send_monitor_pop (libMdsServerShr.so)
                #2  0x00001468153076ab send_monitor_thread (libMdsServerShr.so)
                #3  0x000014681808b1ca start_thread (libpthread.so.0)
                #4  0x0000146817cf7e73 __clone (libc.so.6)

                Stack trace of thread 1046533:
                #0  0x0000146817de4a3f __select (libc.so.6)
                #1  0x0000146816d8e154 io_recv_to (libMdsIpTCP.so)
                #2  0x00001468186d9304 get_bytes_to (libMdsIpShr.so)
                #3  0x00001468186d93e2 get_bytes_to (libMdsIpShr.so)
                #4  0x00001468186e2eba ConnectionDoMessage (libMdsIpShr.so)
                #5  0x0000146816d8d492 client_thread (libMdsIpTCP.so)
                #6  0x000014681808b1ca start_thread (libpthread.so.0)
                #7  0x0000146817cf7e73 __clone (libc.so.6)

                Stack trace of thread 866831:
                #0  0x0000146817de4a3f __select (libc.so.6)
                #1  0x000014681530b42e receiver_thread (libMdsServerShr.so)
                #2  0x000014681808b1ca start_thread (libpthread.so.0)
                #3  0x0000146817cf7e73 __clone (libc.so.6)

GNU gdb (GDB) Red Hat Enterprise Linux 8.2-19.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/mdsplus/bin64/mdsip...done.
[New LWP 866623]
[New LWP 865034]
[New LWP 866830]
[New LWP 1046533]
[New LWP 866831]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `/usr/local/mdsplus/bin/mdsip -s -p 8002 -h /etc/mdsip.hosts'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  do_login (c=0x1) at /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c:78
78      /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c: No such file or directory.
[Current thread is 1 (Thread 0x146814c71700 (LWP 866623))]
Missing separate debuginfos, use: yum debuginfo-install mdsplus-alpha-kernel_bin-7.139-59.el8.x86_64
(gdb) bt
#0  do_login (c=0x1) at /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c:78
#1  ConnectToMds (hostin=hostin@entry=0x1468080fdea0 "atlas.gat.com:8003") at /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c:161
#2  0x000014681530a01e in server_connect (server=server@entry=0x1468080fdea0 "atlas.gat.com:8003", addr=1755710912, port=<optimized out>)
    at /git/mdsplus/servershr/ServerSendMessage.c:529
#3  0x000014681530ba91 in ServerConnect (server_in=server_in@entry=0x146814c70530 "mdsloader") at /git/mdsplus/servershr/ServerSendMessage.c:557
#4  0x000014681530bc57 in ServerSendMessage (msgid=msgid@entry=0x0, server=server@entry=0x146814c70530 "mdsloader", op=op@entry=2,
    retstatus=retstatus@entry=0x146808999338, lock=lock@entry=0x146808999358, conid_out=conid_out@entry=0x14680899933c,
    callback_done=0x146815307b60 <action_done>, callback_param=0xa7, callback_before=0x146815306d50 <before>, numargs_in=3)
    at /git/mdsplus/servershr/ServerSendMessage.c:339
#5  0x000014681530cdd7 in ServerDispatchAction (id=id@entry=0x0, server=0x146814c70530 "mdsloader", tree=0x146808992fb4 "D3D", shot=<optimized out>,
    nid=<optimized out>, ast=ast@entry=0x146815307b60 <action_done>, astprm=0xa7, retstatus=0x146808999338, lock=0x146808999358,
    socket=0x14680899933c, before_ast=0x146815306d50 <before>) at /git/mdsplus/servershr/ServerDispatchAction.c:67
#6  0x0000146815307ec6 in dispatch (i=i@entry=167) at /git/mdsplus/servershr/ServerDispatchPhase.c:201
#7  0x000014681530834c in ServerDispatchPhase (id=id@entry=0x0, vtable=<optimized out>, phasenam=phasenam@entry=0x1468080b36b0 "TEST",
    noact_in=noact_in@entry=0 '\000', sync=sync@entry=1, output_rtn=output_rtn@entry=0x146814867740 <printIt>, monitor=0x0)
    at /git/mdsplus/servershr/ServerDispatchPhase.c:536
#8  0x0000146814868521 in TclDispatch_phase (ctx=<optimized out>, error=0x146814c70c28, output=<optimized out>) at /git/mdsplus/tcl/tcl_dispatch.c:411
#9  0x00001468150f1a9a in dispatchToHandler (getlineinfo=0x0, getline=0x0, output=0x146814c70c20, error=0x146814c70c28, prompt=<optimized out>,
    cmdDef=<optimized out>, cmd=0x1468080fdf60, image=<optimized out>) at /git/mdsplus/mdsdcl/cmdExecute.c:935
#10 processCommand (docList=docList@entry=0x146808011e50, verbNode_in=<optimized out>, cmd=cmd@entry=0x1468080065c0,
    cmdDef=cmdDef@entry=0x146808006610, prompt=prompt@entry=0x146814c70c10, error=error@entry=0x146814c70c28, output=0x146814c70c20, getline=0x0,
    getlineinfo=0x0) at /git/mdsplus/mdsdcl/cmdExecute.c:1165
#11 0x00001468150f30b3 in cmdExecute (cmd=0x1468080065c0, prompt_out=prompt_out@entry=0x0, error_out=error_out@entry=0x146814c70d70,
    output_out=output_out@entry=0x146814c70d78, getline=getline@entry=0x0, getlineinfo=getlineinfo@entry=0x0) at /git/mdsplus/mdsdcl/cmdExecute.c:1453
#12 0x00001468150f6f04 in mdsdcl_do_command_extra_args (command=0x146810008db0 "DISPATCH/PHASE TEST", prompt=prompt@entry=0x0,
    error=error@entry=0x146814c70d70, output=output@entry=0x146814c70d78, getline=getline@entry=0x0, getlineInfo=getlineInfo@entry=0x0)
    at mdsdcl/yylex/cmdParse.y:213
#13 0x00001468150f2abb in mdsdcl_do_command (command=<optimized out>) at /git/mdsplus/mdsdcl/cmdExecute.c:1350
#14 0x0000146815304449 in DoSrvCommand (job_in=job_in@entry=0x1468100430b0) at /git/mdsplus/servershr/ServerQAction.c:642
#15 0x0000146815304b29 in WorkerThread (arg=<optimized out>) at /git/mdsplus/servershr/ServerQAction.c:786
#16 0x000014681808b1ca in start_thread () from /usr/lib64/libpthread.so.0
#17 0x0000146817cf7e73 in clone () from /usr/lib64/libc.so.6
(gdb)

sflanagan avatar Mar 26 '24 17:03 sflanagan

EDIT:

Note that as of the time the following lsof was pulled, those PIDs are for:

  • dispatcher: systemd service mds-8002 on pid 3730021
  • mdsloader: systemd service mds-8003 on pid 3730036
[root@atlas system]# ps -fp 3730021
UID          PID    PPID  C STIME TTY          TIME CMD
mdsadmin 3730021       1  0 10:19 ?        00:00:01 /usr/local/mdsplus/bin/mdsip -s -p 8002 -h /etc/mdsip.hosts -c 0

[root@atlas system]# ps -fp 3730036
UID          PID    PPID  C STIME TTY          TIME CMD
mdsadmin 3730036       1  0 10:19 ?        00:00:01 /usr/local/mdsplus/bin/mdsip -s -p 8003 -h /etc/mdsip.hosts -c 0


A current lsof -u mdsadmin | grep TCP:

[root@atlas ~]# lsof -u mdsadmin | grep TCP
mdsip   3730021 mdsadmin  mem    REG    9,127     80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip   3730021 mdsadmin    3u  IPv4 12579304       0t0       TCP *:8002 (LISTEN)
mdsip   3730021 mdsadmin   22u  IPv4 12298036       0t0       TCP *:8801 (LISTEN)
mdsip   3730021 mdsadmin   28u  IPv4 12428281       0t0       TCP atlas.gat.com:60656->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  102u  IPv4 12298128       0t0       TCP atlas.gat.com:60622->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  103u  IPv4 12298039       0t0       TCP atlas.gat.com:40034->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  104u  IPv4 12410125       0t0       TCP atlas.gat.com:8801->atlas.gat.com:41562 (ESTABLISHED)
mdsip   3730021 mdsadmin  105u  IPv4 12298131       0t0       TCP atlas.gat.com:60636->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  106u  IPv4 12648855       0t0       TCP atlas.gat.com:51018->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  107u  IPv4 12648858       0t0       TCP atlas.gat.com:51026->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  108u  IPv4 12298136       0t0       TCP atlas.gat.com:60650->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  109u  IPv4 12298160       0t0       TCP atlas.gat.com:52952->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  110u  IPv4 12298181       0t0       TCP atlas.gat.com:49554->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  111u  IPv4 12420686       0t0       TCP atlas.gat.com:54658->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  112u  IPv4 12420672       0t0       TCP atlas.gat.com:49558->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  115u  IPv4 12420676       0t0       TCP atlas.gat.com:57096->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  116u  IPv4 12298188       0t0       TCP atlas.gat.com:57098->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  117u  IPv4 12420679       0t0       TCP atlas.gat.com:57114->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  118u  IPv4 12420682       0t0       TCP atlas.gat.com:57122->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  119u  IPv4 12298192       0t0       TCP atlas.gat.com:54668->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  120u  IPv4 12420690       0t0       TCP atlas.gat.com:38610->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  121u  IPv4 12298196       0t0       TCP atlas.gat.com:38614->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  122u  IPv4 12427862       0t0       TCP atlas.gat.com:43624->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  123u  IPv4 12427868       0t0       TCP atlas.gat.com:43626->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  124u  IPv4 12427871       0t0       TCP atlas.gat.com:43630->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  125u  IPv4 12427877       0t0       TCP atlas.gat.com:43644->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  126u  IPv4 12427884       0t0       TCP atlas.gat.com:43650->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  128u  IPv4 12427961       0t0       TCP atlas.gat.com:47100->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  129u  IPv4 12298215       0t0       TCP atlas.gat.com:47072->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  130u  IPv4 12427957       0t0       TCP atlas.gat.com:47092->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  131u  IPv4 12427954       0t0       TCP atlas.gat.com:47086->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  132u  IPv4 12428079       0t0       TCP atlas.gat.com:47680->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  133u  IPv4 12428055       0t0       TCP atlas.gat.com:33592->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  134u  sock      0,9       0t0  12428052 protocol: TCP
mdsip   3730021 mdsadmin  135u  IPv4 12428059       0t0       TCP atlas.gat.com:33608->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  136u  IPv4 12428094       0t0       TCP atlas.gat.com:46216->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  137u  IPv4 12428232       0t0       TCP atlas.gat.com:46218->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  138u  IPv4 12428235       0t0       TCP atlas.gat.com:46228->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  139u  IPv4 12428240       0t0       TCP atlas.gat.com:46234->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  140u  IPv4 12648863       0t0       TCP atlas.gat.com:51030->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  142u  IPv4 12428244       0t0       TCP atlas.gat.com:46250->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  143u  IPv4 12428251       0t0       TCP atlas.gat.com:46266->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  144u  IPv4 12428270       0t0       TCP atlas.gat.com:60642->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  145u  IPv4 12728337       0t0       TCP atlas.gat.com:46278->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  146u  IPv4 12728340       0t0       TCP atlas.gat.com:46282->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  147u  IPv4 12728343       0t0       TCP atlas.gat.com:46292->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  148u  IPv4 12428267       0t0       TCP atlas.gat.com:46306->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  149u  IPv4 12428273       0t0       TCP atlas.gat.com:60654->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  153u  IPv4 12740616       0t0       TCP atlas.gat.com:60674->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  154u  IPv4 12428284       0t0       TCP atlas.gat.com:60666->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  155u  IPv4 12428287       0t0       TCP atlas.gat.com:60670->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  157u  IPv4 12728345       0t0       TCP atlas.gat.com:60684->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  158u  IPv4 12715420       0t0       TCP atlas.gat.com:60694->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  159u  IPv4 12715423       0t0       TCP atlas.gat.com:60710->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  160u  IPv4 12715428       0t0       TCP atlas.gat.com:60720->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  161u  IPv4 12715431       0t0       TCP atlas.gat.com:44726->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  162u  IPv4 12715435       0t0       TCP atlas.gat.com:44736->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  163u  IPv4 12401444       0t0       TCP atlas.gat.com:44744->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  164u  IPv4 12401452       0t0       TCP atlas.gat.com:44762->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  165u  IPv4 12401459       0t0       TCP atlas.gat.com:44772->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  166u  IPv4 12401449       0t0       TCP atlas.gat.com:44760->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  167u  IPv4 12401464       0t0       TCP atlas.gat.com:44788->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  168u  IPv4 12401467       0t0       TCP atlas.gat.com:44804->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  169u  IPv4 12715452       0t0       TCP atlas.gat.com:44806->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  170u  IPv4 12715460       0t0       TCP atlas.gat.com:EtherNet/IP-2->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  171u  IPv4 12715463       0t0       TCP atlas.gat.com:44832->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  172u  IPv4 12401471       0t0       TCP atlas.gat.com:57696->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  173u  IPv4 12401475       0t0       TCP atlas.gat.com:57708->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  174u  IPv4 12401480       0t0       TCP atlas.gat.com:57712->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  175u  IPv4 12401483       0t0       TCP atlas.gat.com:57720->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  177u  IPv4 12401488       0t0       TCP atlas.gat.com:57734->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  178u  IPv4 12401493       0t0       TCP atlas.gat.com:57750->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  179u  IPv4 12401498       0t0       TCP atlas.gat.com:57752->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  180u  IPv4 12401501       0t0       TCP atlas.gat.com:57760->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  181u  IPv4 12401505       0t0       TCP atlas.gat.com:57768->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  182u  IPv4 12401508       0t0       TCP atlas.gat.com:57782->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  183u  IPv4 12715507       0t0       TCP atlas.gat.com:57784->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  184u  IPv4 12715550       0t0       TCP atlas.gat.com:60928->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  185u  IPv4 12715558       0t0       TCP atlas.gat.com:60946->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  186u  IPv4 12715553       0t0       TCP atlas.gat.com:60932->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  188u  IPv4 12715562       0t0       TCP atlas.gat.com:60948->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  189u  IPv4 12715568       0t0       TCP atlas.gat.com:60964->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  190u  IPv4 12715573       0t0       TCP atlas.gat.com:60966->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  191u  IPv4 12401511       0t0       TCP atlas.gat.com:60980->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  192u  IPv4 12401515       0t0       TCP atlas.gat.com:60994->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  193u  IPv4 12401608       0t0       TCP atlas.gat.com:59376->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  194u  IPv4 12401605       0t0       TCP atlas.gat.com:59374->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730026 mdsadmin  mem    REG    9,127     80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip   3730026 mdsadmin    3u  IPv4 12568210       0t0       TCP *:8012 (LISTEN)
mdsip   3730036 mdsadmin  mem    REG    9,127     80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip   3730036 mdsadmin    3u  IPv4 12568211       0t0       TCP *:8003 (LISTEN)
mdsip   3730036 mdsadmin    4u  IPv4 12609768       0t0       TCP atlas.gat.com:8003->atlas.gat.com:40034 (ESTABLISHED)
mdsip   3730036 mdsadmin   61u  IPv4 12628995       0t0       TCP atlas.gat.com:53526->atlas.gat.com:sunwebadmin (CLOSE_WAIT)
mdsip   3730036 mdsadmin   62u  IPv4 12628998       0t0       TCP atlas.gat.com:41562->atlas.gat.com:8801 (ESTABLISHED)
mdsip   3730036 mdsadmin   63u  IPv4 12609794       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60622 (ESTABLISHED)
mdsip   3730036 mdsadmin   64u  IPv4 12609800       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60650 (ESTABLISHED)
mdsip   3730036 mdsadmin   65u  IPv4 12230182       0t0       TCP atlas.gat.com:8003->atlas.gat.com:52952 (ESTABLISHED)
mdsip   3730036 mdsadmin   66u  IPv4 12230189       0t0       TCP atlas.gat.com:8003->atlas.gat.com:49554 (ESTABLISHED)
mdsip   3730036 mdsadmin   67u  IPv4 12230192       0t0       TCP atlas.gat.com:8003->atlas.gat.com:49558 (ESTABLISHED)
mdsip   3730036 mdsadmin   68u  IPv4 12230203       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57096 (ESTABLISHED)
mdsip   3730036 mdsadmin   69u  IPv4 12230209       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57114 (ESTABLISHED)
mdsip   3730036 mdsadmin   70u  IPv4 12230212       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57122 (ESTABLISHED)
mdsip   3730036 mdsadmin   71u  IPv4 12230250       0t0       TCP atlas.gat.com:8003->atlas.gat.com:54658 (ESTABLISHED)
mdsip   3730036 mdsadmin   72u  IPv4 12230260       0t0       TCP atlas.gat.com:8003->atlas.gat.com:54668 (ESTABLISHED)
mdsip   3730036 mdsadmin   73u  IPv4 12230263       0t0       TCP atlas.gat.com:8003->atlas.gat.com:38610 (ESTABLISHED)
mdsip   3730036 mdsadmin   74u  IPv4 12230266       0t0       TCP atlas.gat.com:8003->atlas.gat.com:38614 (ESTABLISHED)
mdsip   3730036 mdsadmin   75u  IPv4 12715064       0t0       TCP atlas.gat.com:8003->atlas.gat.com:43624 (ESTABLISHED)
mdsip   3730036 mdsadmin   76u  IPv4 12715069       0t0       TCP atlas.gat.com:8003->atlas.gat.com:43626 (ESTABLISHED)
mdsip   3730036 mdsadmin   77u  IPv4 12715076       0t0       TCP atlas.gat.com:8003->atlas.gat.com:43644 (ESTABLISHED)
mdsip   3730036 mdsadmin   78u  IPv4 12715079       0t0       TCP atlas.gat.com:8003->atlas.gat.com:43650 (ESTABLISHED)
mdsip   3730036 mdsadmin   79u  IPv4 12230269       0t0       TCP atlas.gat.com:8003->atlas.gat.com:47072 (ESTABLISHED)
mdsip   3730036 mdsadmin   80u  IPv4 12715164       0t0       TCP atlas.gat.com:8003->atlas.gat.com:47086 (ESTABLISHED)
mdsip   3730036 mdsadmin   81u  IPv4 12715167       0t0       TCP atlas.gat.com:8003->atlas.gat.com:47092 (ESTABLISHED)
mdsip   3730036 mdsadmin   82u  IPv4 12715170       0t0       TCP atlas.gat.com:8003->atlas.gat.com:47100 (ESTABLISHED)
mdsip   3730036 mdsadmin   83u  IPv4 12582098       0t0       TCP atlas.gat.com:8003->atlas.gat.com:33592 (ESTABLISHED)
mdsip   3730036 mdsadmin   84u  IPv4 12582101       0t0       TCP atlas.gat.com:8003->atlas.gat.com:33608 (ESTABLISHED)
mdsip   3730036 mdsadmin   85u  IPv4 12582112       0t0       TCP atlas.gat.com:8003->atlas.gat.com:47680 (ESTABLISHED)
mdsip   3730036 mdsadmin   86u  IPv4 12582115       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46216 (ESTABLISHED)
mdsip   3730036 mdsadmin   87u  IPv4 12582118       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46218 (ESTABLISHED)
mdsip   3730036 mdsadmin   88u  IPv4 12582124       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46234 (ESTABLISHED)
mdsip   3730036 mdsadmin   89u  IPv4 12582127       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46250 (ESTABLISHED)
mdsip   3730036 mdsadmin   90u  IPv4 12582130       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46266 (ESTABLISHED)
mdsip   3730036 mdsadmin   91u  IPv4 12625488       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46278 (ESTABLISHED)
mdsip   3730036 mdsadmin   92u  IPv4 12582136       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60642 (ESTABLISHED)
mdsip   3730036 mdsadmin   93u  IPv4 12582139       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60654 (ESTABLISHED)
mdsip   3730036 mdsadmin   94u  IPv4 12582142       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60656 (ESTABLISHED)
mdsip   3730036 mdsadmin   95u  IPv4 12582151       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60674 (ESTABLISHED)
mdsip   3730036 mdsadmin   96u  IPv4 12582154       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60694 (ESTABLISHED)
mdsip   3730036 mdsadmin   97u  IPv4 12582159       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60710 (ESTABLISHED)
mdsip   3730036 mdsadmin   98u  IPv4 12582162       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60720 (ESTABLISHED)
mdsip   3730036 mdsadmin   99u  IPv4 12582165       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44726 (ESTABLISHED)
mdsip   3730036 mdsadmin  100u  IPv4 12582168       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44736 (ESTABLISHED)
mdsip   3730036 mdsadmin  101u  IPv4 12625500       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44744 (ESTABLISHED)
mdsip   3730036 mdsadmin  102u  IPv4 12625504       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44760 (ESTABLISHED)
mdsip   3730036 mdsadmin  103u  IPv4 12625510       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44772 (ESTABLISHED)
mdsip   3730036 mdsadmin  104u  IPv4 12625513       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44788 (ESTABLISHED)
mdsip   3730036 mdsadmin  105u  IPv4 12625516       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44804 (ESTABLISHED)
mdsip   3730036 mdsadmin  106u  IPv4 12582173       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44806 (ESTABLISHED)
mdsip   3730036 mdsadmin  107u  IPv4 12582177       0t0       TCP atlas.gat.com:8003->atlas.gat.com:EtherNet/IP-2 (ESTABLISHED)
mdsip   3730036 mdsadmin  108u  IPv4 12625519       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57696 (ESTABLISHED)
mdsip   3730036 mdsadmin  109u  IPv4 12625522       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57708 (ESTABLISHED)
mdsip   3730036 mdsadmin  110u  IPv4 12625525       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57712 (ESTABLISHED)
mdsip   3730036 mdsadmin  111u  IPv4 12625528       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57720 (ESTABLISHED)
mdsip   3730036 mdsadmin  112u  IPv4 12625531       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57734 (ESTABLISHED)
mdsip   3730036 mdsadmin  113u  IPv4 12625534       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57750 (ESTABLISHED)
mdsip   3730036 mdsadmin  114u  IPv4 12625537       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57752 (ESTABLISHED)
mdsip   3730036 mdsadmin  115u  IPv4 12625543       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57768 (ESTABLISHED)
mdsip   3730036 mdsadmin  116u  IPv4 12625546       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57782 (ESTABLISHED)
mdsip   3730036 mdsadmin  117u  IPv4 12514018       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57784 (ESTABLISHED)
mdsip   3730036 mdsadmin  118u  IPv4 12514021       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60928 (ESTABLISHED)
mdsip   3730036 mdsadmin  119u  IPv4 12514024       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60932 (ESTABLISHED)
mdsip   3730036 mdsadmin  120u  IPv4 12514027       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60946 (ESTABLISHED)
mdsip   3730036 mdsadmin  122u  IPv4 12514031       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60948 (ESTABLISHED)
mdsip   3730036 mdsadmin  123u  IPv4 12609797       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60636 (ESTABLISHED)
mdsip   3730036 mdsadmin  124u  IPv4 12514034       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60964 (ESTABLISHED)
mdsip   3730036 mdsadmin  125u  IPv4 12514037       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60966 (ESTABLISHED)
mdsip   3730036 mdsadmin  126u  IPv4 12625549       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60980 (ESTABLISHED)
mdsip   3730036 mdsadmin  127u  IPv4 12625552       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60994 (ESTABLISHED)
mdsip   3730036 mdsadmin  128u  IPv4 12625561       0t0       TCP atlas.gat.com:8003->atlas.gat.com:59374 (ESTABLISHED)
mdsip   3730036 mdsadmin  129u  IPv4 12230206       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57098 (ESTABLISHED)
mdsip   3730036 mdsadmin  130u  IPv4 12625564       0t0       TCP atlas.gat.com:8003->atlas.gat.com:59376 (ESTABLISHED)
mdsip   3730036 mdsadmin  131u  IPv4 12401620       0t0       TCP atlas.gat.com:8003->atlas.gat.com:51018 (ESTABLISHED)
mdsip   3730036 mdsadmin  132u  IPv4 12401626       0t0       TCP atlas.gat.com:8003->atlas.gat.com:51030 (ESTABLISHED)
mdsip   3730036 mdsadmin  138u  IPv4 12715072       0t0       TCP atlas.gat.com:8003->atlas.gat.com:43630 (ESTABLISHED)
mdsip   3730036 mdsadmin  150u  IPv4 12582121       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46228 (ESTABLISHED)
mdsip   3730036 mdsadmin  153u  IPv4 12582145       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60666 (ESTABLISHED)
mdsip   3730036 mdsadmin  155u  IPv4 12625491       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46282 (ESTABLISHED)
mdsip   3730036 mdsadmin  156u  IPv4 12625494       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46292 (ESTABLISHED)
mdsip   3730036 mdsadmin  157u  IPv4 12582133       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46306 (ESTABLISHED)
mdsip   3730036 mdsadmin  160u  IPv4 12582148       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60670 (ESTABLISHED)
mdsip   3730036 mdsadmin  164u  IPv4 12625497       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60684 (ESTABLISHED)
mdsip   3730036 mdsadmin  172u  IPv4 12625507       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44762 (ESTABLISHED)
mdsip   3730036 mdsadmin  178u  IPv4 12582180       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44832 (ESTABLISHED)
mdsip   3730036 mdsadmin  186u  IPv4 12625540       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57760 (ESTABLISHED)
mdsip   3730036 mdsadmin  201u  IPv4 12401623       0t0       TCP atlas.gat.com:8003->atlas.gat.com:51026 (ESTABLISHED)
mdsip   3767135 mdsadmin  mem    REG    9,127     80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip   3767135 mdsadmin    0u  IPv4 12230301       0t0       TCP atlas.gat.com:mdsip->iris29.gat.com:59836 (ESTABLISHED)
mdsip   3771645 mdsadmin  mem    REG    9,127     80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip   3771645 mdsadmin    0u  IPv4 12689154       0t0       TCP atlas.gat.com:mdsip->iris29.gat.com:60068 (ESTABLISHED)

sflanagan avatar Mar 26 '24 17:03 sflanagan

Systemd contents:

mds-8002

[Unit]
Description=MDSplus Dispatcher Service
After=network.target

[Service]
User=mdsadmin
Type=simple
LimitNOFILE=64000
LimitCORE=infinity
ExecStart=/usr/local/mdsplus/bin/mdsipd-dispatching 8002 -block

[Install]
WantedBy=multi-user.target

mds-8003:

[Unit]
Description=MDSplus Loader Service
After=network.target

[Service]
Type=simple
LimitNOFILE=64000
LimitCORE=infinity
ExecStart=/usr/local/mdsplus/bin/mdsipd-dispatching 8003 -block

[Install]
WantedBy=multi-user.target

OS (atlas.gat.com):

[root@atlas system]# cat /etc/*-release
NAME="Red Hat Enterprise Linux"
VERSION="8.8 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.8 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"
Red Hat Enterprise Linux release 8.8 (Ootpa)
Red Hat Enterprise Linux release 8.8 (Ootpa)

MDSplus Version:

[root@atlas system]# rpm -qa | grep mds
mdsplus-alpha-mssql-7.139-59.el8.x86_64
mdsplus-alpha-idl_bin-7.139-59.el8.x86_64
mdsplus-alpha-kernel_bin-7.139-59.el8.x86_64
mdsplus-alpha-d3d-7.139-59.el8.noarch
mdsplus-alpha-kernel-7.139-59.el8.noarch
mdsplus-alpha-motif-7.139-59.el8.noarch
mdsplus-alpha-repo-7.139-40.el8.noarch
mdsplus-alpha-motif_bin-7.139-59.el8.x86_64
mdsplus-alpha-python-7.139-59.el8.noarch
mdsplus-alpha-idl-7.139-59.el8.noarch

To explain mdsipd-dispatching:

#!/bin/sh -f
#
# MDSplus server daemon procedure
#
# Parameters:
#
#       $1 = service or portnumber

if [ $# -lt 1  ]
then
  echo ""
  echo "Correct usage is $0 portnumber [-block]"
  echo example: $0 8001
  echo ""
  exit
fi

MDSPLUS_DIR=/usr/local/mdsplus

# Note that we source the setup-server.sh file because this is to be a server
# process, not a regular data server process.
. $MDSPLUS_DIR/setup-server.sh

logfile=/var/log/mdsplus/mdsip-server/$1.log
echo "" >> $logfile 2>> $logfile
echo "Telling any existing server on this port to shut down..." >> $logfile 2>> $logfile
/usr/local/mdsplus/bin/mdstcl stop server localhost:$1 >> $logfile 2>> $logfile
echo "" >> $logfile 2>> $logfile
echo "############# END OF OLD LOGFILE #############" >> $logfile 2>> $logfile
echo "" >> $logfile 2>> $logfile
echo "############ START OF NEW LOGFILE ############" >> $logfile 2>> $logfile
echo "Starting mdsip -s on port "$1" at "`date` >> $logfile  2>> $logfile
echo "--------------------" >> $logfile  2>> $logfile
echo "Showing ulimit -a" >> $logfile  2>> $logfile
ulimit -a >> $logfile  2>> $logfile
echo "" >> $logfile  2>> $logfile
echo "--------------------" >> $logfile  2>> $logfile
echo "" >> $logfile  2>> $logfile

# if -block is specified, then don't run in the background
if [ $# -gt 1 ]
then
  if [ $2 = "-block" ]
  then
    exec $MDSPLUS_DIR/bin/mdsip -s -p $1 -h /etc/mdsip.hosts -c 0 >> $logfile 2>> $logfile
    exit 0
  fi
fi
exec $MDSPLUS_DIR/bin/mdsip -s -p $1 -h /etc/mdsip.hosts -c 0 >> $logfile 2>> $logfile &

Please note that the error mentioned occurs after ~1000 actions nodes triggered.

This happens whether it its 1k actions in a single shot, or across multiple shots.

sflanagan avatar Mar 26 '24 17:03 sflanagan

Thanks for all the info. A quick workaround: add Restart=always to you service files after the last dispatch /phase add exit

joshStillerman avatar Mar 26 '24 17:03 joshStillerman

Hi @sflanagan -- Josh has succeeded in reproducing the issue with the current alpha on Ubuntu. We are now debugging the problem. And will post additional updates here later this afternoon.

mwinkel-dev avatar Mar 26 '24 19:03 mwinkel-dev

Took some time to configure my dev environment, but I am now also able to reproduce the bug. So will attach the debugger and figure out how the code is leaving dangling sockets.

mwinkel-dev avatar Mar 26 '24 22:03 mwinkel-dev

Progress has been made on debugging the issue, but still more to do to locate the root cause of the dangling sockets. Have provided my initial findings to Josh and Stephen for review and suggestions.

mwinkel-dev avatar Mar 27 '24 06:03 mwinkel-dev

@mwinkel-dev Glad you were able to reproduce. Did the workaround proposed by @joshStillerman work?

smithsp avatar Mar 27 '24 17:03 smithsp

Hi @smithsp and @sflanagan -- I will test Josh's workaround right now. I'm sure it will work.

I have been manually killing the mdsip process between experiments and that of course cleans up the dangling sockets. Nonetheless, you have a very good point that I should also automate that by using Josh's workaround (which kills the mdsip at the end of the shot and relies on systemd to restart the process).

Current investigation shows that the dangling sockets gradually grow until the process that launches the actions (such as TCL exits). When the process exits, the sockets are cleaned up. After I test Josh's workaround, I will resume looking for a fix that kills the sockets when each action thread finishes, instead of waiting until the entire process exits.

mwinkel-dev avatar Mar 27 '24 21:03 mwinkel-dev

Hi @smithsp and @sflanagan -- I have tested Josh's recommended workaround and it is A-OK.

Here are the details of the experiment . . .

  • There are two action nodes in a dinky tree named zzz. Each action node just sleeps for 10 seconds.
  • The mdsip access to the zzz_<shot> trees is on the usual port 8000.
  • The mdsip process for actions (aka action server) is on port 8100 and configured as a systemd service.
  • The systemd service file is similar to GA's, but also has Restart=always in the [Service] section.
  • Then these TCL commands are executed (as per Josh's reproduce case).
TCL> set tree zzz
TCL> create pulse 1
TCL> set tree zzz /shot=1
TCL> dispatch /build
TCL> dispatch/phase/log init
  • After ~20 seconds, control returns to the TCL> prompt.
  • In another terminal window, run lsof and grep for the action server on port 8100. Note the dangling TCP sockets. (They do get cleaned up if exit TCL. However for this example we wish to remain in TCL.)
mdsip     3899986                            mdsplus  mem       REG                8,5    102144   31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip     3899986                            mdsplus    3u     IPv4           30142804       0t0        TCP *:8100/tcp (LISTEN)
mdsip     3899986                            mdsplus    6u     IPv4           30155996       0t0        TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip     3899986                            mdsplus    7u     IPv4           30144205       0t0        TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip     3899986                            mdsplus    9u     IPv4           30156000       0t0        TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
mdsip     3899986 3900283 mdsip              mdsplus  mem       REG                8,5    102144   31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip     3899986 3900283 mdsip              mdsplus    3u     IPv4           30142804       0t0        TCP *:8100/tcp (LISTEN)
mdsip     3899986 3900283 mdsip              mdsplus    6u     IPv4           30155996       0t0        TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip     3899986 3900283 mdsip              mdsplus    7u     IPv4           30144205       0t0        TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip     3899986 3900283 mdsip              mdsplus    9u     IPv4           30156000       0t0        TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
mdsip     3899986 3900284 mdsip              mdsplus  mem       REG                8,5    102144   31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip     3899986 3900284 mdsip              mdsplus    3u     IPv4           30142804       0t0        TCP *:8100/tcp (LISTEN)
mdsip     3899986 3900284 mdsip              mdsplus    6u     IPv4           30155996       0t0        TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip     3899986 3900284 mdsip              mdsplus    7u     IPv4           30144205       0t0        TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip     3899986 3900284 mdsip              mdsplus    9u     IPv4           30156000       0t0        TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
mdsip     3899986 3900285 mdsip              mdsplus  mem       REG                8,5    102144   31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip     3899986 3900285 mdsip              mdsplus    3u     IPv4           30142804       0t0        TCP *:8100/tcp (LISTEN)
mdsip     3899986 3900285 mdsip              mdsplus    6u     IPv4           30155996       0t0        TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip     3899986 3900285 mdsip              mdsplus    7u     IPv4           30144205       0t0        TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip     3899986 3900285 mdsip              mdsplus    9u     IPv4           30156000       0t0        TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
  • Then at the TCL> prompt, type this: TCL> stop server localhost:8100
  • That kills the current action server.
  • Whereupon systemd automatically restarts the service. It remains on port 8100, but of course has a new process ID (i.e., compare the following process ID with the one above).
root@mfews-mwinkel:/home/mwinkel/x_markw/config# ps -ef | fgrep mdsip
mdsplus  3900338       1  0 18:57 ?        00:00:00 /usr/local/xmw_mds/mdsplus/bin/mdsip -s -p 8100 -h /tmp/xmw_ga_2731/mdsip.hosts -c 9
  • And as expected, lsof shows that the new process has just a single TCP socket listening on port 8100.
root@mfews-mwinkel:/home/mwinkel/x_markw/config# lsof | fgrep 3900338 | fgrep TCP
mdsip     3900338                            mdsplus  mem       REG                8,5    102144   31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip     3900338                            mdsplus    3u     IPv4           30160993       0t0        TCP *:8100/tcp (LISTEN)
  • While still at the TCL> prompt, run a new shot (i.e., repeat all the steps listed above). And observe the same behavior.

In short, Josh's workaround works fine on my Ubuntu 20 development system.

Let us know if you have any problems with this workaround on GA's RHEL8 server.

mwinkel-dev avatar Mar 27 '24 23:03 mwinkel-dev

Hi @smithsp and @sflanagan -- Root cause of the dangling sockets in mdstcl has been found. When "actions" complete, there is a coordination problem between two threads. Conceptually, one thread attempts to delete the associated socket, but can't because the other thread still has a reference to it. (That is a generalization; the details are a bit different.)

It will take us a few days to fix the issue (i.e., iterate on the design, manual testing, and peer review).

While investigating this issue, I have also been using the systemd workaround (so I have a clean slate for each experiment). The workaround is working fine on my development system.

Let me know if you have any problems with the systemd workaround on GA's server.

Addendum This post (and many of the following posts) were initial conjectures and thus wrong. For the final summary of the investigation, refer to the post at this link. https://github.com/MDSplus/mdsplus/issues/2731#issuecomment-2067188950

mwinkel-dev avatar Mar 29 '24 23:03 mwinkel-dev

Hi @sflanagan -- How many "actions" are issued when running a single DIII-D shot? And how many "action servers" are involved? And how many shots per day?

Would like that information so that I can design a test case that is many times more load on the "action server" than the DIII-D load.

mwinkel-dev avatar Apr 01 '24 15:04 mwinkel-dev

Assuming, as an upper bound, that everything is enabled and ops is having a long and/or productive run day...

How many "actions" are issued when running a single DIII-D shot?

~300

And how many "action servers" are involved?

1

And how many shots per day?

~50

sflanagan avatar Apr 01 '24 21:04 sflanagan

Let me know if you have any problems with the systemd workaround on GA's server.

I redesigned my mds-8002 systemd file to add:

User=mdsadmin
Restart=always
ExecStartPost=/usr/local/mdsplus/local/mds-8002-post.sh

Where the mds-8002-post.sh just automatically sets up the dispatch table, shot number(s), opens the relevant shot tree(s), etc... and then triggers a "test" action node.

Seems "fine" as a temporary workaround, but it does leave me open to skipping/missing a shot if the dispatcher service is restarting while the DAQ system is trying to start a new shot cycle. Trying to reduce that risk is next up on my agenda (if I can).

sflanagan avatar Apr 01 '24 22:04 sflanagan

Hi @sflanagan -- Thanks for the information about the number of actions in a DIII-D shot. And for the news regarding the systemd workaround.

mwinkel-dev avatar Apr 02 '24 15:04 mwinkel-dev

Now have a prototype fix. Next step is to do some manual stress testing, followed by peer review. Will take a few more days before there is a final fix that will be merged to alpha and cherry-picked to the GA branch.

mwinkel-dev avatar Apr 03 '24 04:04 mwinkel-dev

Hi @smithsp and @sflanagan,

Summary This post is a summary of recent emails with @smithsp, plus provides additional details.

This problem was introduced with PR #2288 on 19-May-2021. The bug appears to only affect the dispatch command of TCL. Thus, it is likely that the associated fix is low risk. (Testing is presently underway to confirm that is indeed true.)

Details When mdstcl (which is a client) dispatches an action, it also creates a "receiver" thread to receive the eventual status reply regarding that action. (The action is sent to an "action server" that then processes the action and sends a reply back to the mdstcl client.) For each action that is dispatched, mdstcl thus opens up two sockets: one is the "to" socket used to send the action to the action server, and the other is the "from" socket used to receive the reply from the action server.

The issue is that the "connection ID table" in the mdstcl client should be accessible by all threads, but unfortunately isn't. Thus, when the receiver thread receives the SrvJobFINISHED reply, it can close its "from" socket, but is unable to access the main thread's "connection ID table" in order to close the "to" socket. Thus, for each action dispatched, the current code leaves a dangling "to" socket.

The two workarounds clean up the dangling "to" sockets. Adding an "exit" statement at the end of a TCL script that dispatches actions terminates the TCL process thus all of its sockets are closed. Alternatively, using the stop server command of TCL will cause the action server to reboot (if systemd or xinetd is configured correctly) which also closes all dangling sockets.

The proposed fix does not alter any logic regarding sockets and connections. All that it does is allow the "receiver" thread to access the "connection ID table" so that the "to" socket can be closed. Preliminary testing confirms that the fix works and that mdstcl is no longer leaking a "to" socket for each dispatched action. Thus, the workarounds are no longer needed.

Low Risk? Whenever fiddling with multi-threaded code, there is always a risk of race conditions being created. Hence the focus on extensive testing of the fix prior to release to GA. The goal of the testing is to confirm that all critical sections are protected with locks. Initial testing is promising; no threading issues have been encountered.

And because the problem appears to be limited to the dispatch command of mdstcl, that also reduces the risk associated with the fix. It is unlikely to affect other aspects of the MDSplus suite of tools.

Note also that this bug has been present in the alpha branch for ~3 years, and that no other customers have reported problems. That tends to confirm that the problem doesn't affect the entire socket communication foundation of MDSplus. And likely also means that the fix will have lower risks than expected. (Regardless, extensive testing is needed to confirm that is so.)

Clarification This issue affects all commands sent to an "action server". That includes the following activities (i.e., the list gives descriptions; the actual TCL statements are likely different).

  • abort server
  • close trees
  • create pulse
  • dispatch action
  • dispatch command
  • monitor check-in
  • monitor
  • logging
  • stop server

mwinkel-dev avatar Apr 04 '24 18:04 mwinkel-dev

Hi @sflanagan -- Questions about your site's single "action server" . . .

  • What version of MDSplus does it run?
  • What Operating System and version?
  • Does the "action server" MDSplus/OS version match that of the "dispatch server" (apparently Atlas)?
  • Is it a dedicated "action server" (only processes) actions? (Am curious if it is used for other MDSplus activities such as data analysis and/or to run non-MDSplus applications.)

Answers to the above questions will enable us to set up a cross-version testing environment, if needed.

mwinkel-dev avatar Apr 04 '24 18:04 mwinkel-dev

A stress test of 1,200 actions (400 actions per shot, 3 shots) reveals that there is an unprotected critical section that can cause a segfault. The segfault also occurs in the alpha branch thus was not introduced by the proposed fix.

Adding a delay of 0.01 seconds to each action eliminates the segfault, but causes the proposed fix to leave ~10 dangling sockets (i.e., it closes the other 1,190 "to" sockets). Additional investigation is required to determine if the dangling sockets are caused by a second unprotected critical section. And if so, whether it was introduced with the proposed fix.

Increasing the delay to 0.1 seconds allows the proposed fix to work fine -- no dangling sockets.

mwinkel-dev avatar Apr 05 '24 05:04 mwinkel-dev

The proposed fix successfully handled 20,000 actions without leaking sockets. (However, each action had a 0.1 second delay.) That was 50 back-to-back shots with 400 actions per shot. The test was designed to generate a spike load on the "action server".

An even more extreme stress test does create a segfault in both the current alpha and the proposed fix. Preliminary investigation indicates that the segfault is probably a pre-existing issue and was not introduced by the proposed fix. The segfault investigation continues (and perhaps will be tracked in a new GitHub issue).

Summary is that the proposed fix is now working well enough that it is moving on to the peer review stage. However, it will still be a few days before the fix is ready for release to GA.

mwinkel-dev avatar Apr 05 '24 15:04 mwinkel-dev

Thanks for the update @mwinkel-dev .

smithsp avatar Apr 05 '24 15:04 smithsp

Took all of Saturday to figure out that there is a "race" condition associated with this fix.

When mdstcl dispatches an action to the "action server", the server sends back two messages: "handshake" and "done". And they are received by two different threads.

  • The "handshake" message is when the server acknowledges that it has received an action.
  • The "done" message is sent when the server has processed the action.

Note: The names, "handshake" and "done", are just for clarity in this post. Those names don't appear in the source code.

In practice, there is usually a fair amount of time between the two messages (e.g., it takes time for an action to configure a digitizer). And that time interval enables the code to process the "handshake" before the "done".

However, the spike load test has an action that can be done instantly (i.e., just evaluating the number 1). And thus on a heavily loaded action server, the "done" message can be received and processed by mdstcl before the "handshake" is fully processed. Because processing the "done" message deletes the associated connection, the "handshake" code gets mighty confused when its connection id gets clobbered. Hence the errors and occasional segfaults.

Conceptually, the ideal fix to avoid the "race" condition is to use a lock on every single connection object. However, it will likely be a challenge to retrofit that into the existing code. (There is a risk of subtle bugs creeping in that could result in deadlock.)

The practical fix is probably just to add 100ms delay before processing the "done" message and closing the connection. Experiments prove that this simple approach almost always avoids the race condition. (It leaked 2 or 3 sockets out of 400 actions processed.)

However, it is important that the dispatch cycle roughly match the added delay, so that connections are created and destroyed at approximately the same rate. If the delay is too long, and there are hundreds of actions dispatched at a high rate, then all sockets (allowed by the operating system) can be consumed thereby causing mdstcl and/or the action server to crash.

An alternative concept for matching connection creation / destruction rates is for mdstcl to keep a running tally of open connections. And if the count hits a user specified threshold (e.g., 500 active connections max), then it pauses dispatching actions. As actions are drained from the server's input queue, mdstcl will close connections accordingly. And thus as each action finishes, mdstcl will be able to dispatch another action.

Combining both approaches would allow mdstcl to avoid both problems: race condition and no more sockets.

mwinkel-dev avatar Apr 07 '24 03:04 mwinkel-dev

There is a fatal flaw in the proposed fix (call it "Option A") described in the preceding posts: deadlock.

The existing architecture of the mdstcl dispatch feature consists of the following:

  • a "connection id" table that contains information about each connection (sockets used, connection state, and so forth),
  • a main thread that also dispatches actions, and
  • many "receiver" threads, one per each action sent to the action server.

Option A treats the "connection id" table as a resource shared by all threads. And thus locks are used to protect access when adding or deleting connections. However, in order for that to work well, the main thread needs to be throttled so that the receiver threads have an opportunity to access the table too. Note though that if a receiver thread obtains the lock and then dies before releasing the lock, that the main thread is thus permanently blocked from dispatching more actions. Thus mdstcl deadlocks. (There are likely other edge cases that would also create deadlock.)

Although the principles and architectures for high-performance, robust, multi-threaded, web servers and clients are now well known, the challenge is in retrofitting the existing action server architecture to accommodate those features.

Option B might be a more practical and less risky option.

It has these features . . .

  • the main dispatch thread passes a connection's socket to the associated receiver thread
  • thus the receiver threads can close sockets without having to access the "connection id" table
  • that eliminates the need for locking on the table (because it is only accessed by the main thread)
  • it only allows ~N active connections at a time, where N is chosen to be well below the number of sockets allowed by the operating system,
  • N is user configurable (likely via an environment variable)

Advantages:

  • doesn't immediately consume all available sockets when dispatching hundreds of actions
  • doesn't leak sockets
  • avoids deadlock

Disadvantages:

  • when the action server is finished processing an action, its entry in the "connection id" table becomes obsolete
  • thus the "connection id" table constantly grows

Analysis:

  • suppose that 20,000 actions are dispatched in a day
  • the memory required to store a "connection id" table of 20,000 entries is fairly small -- probably just a few MB
  • the growth of the table is highly unlikely to force a reboot of the server (as is the case if run out of sockets)
  • a cron job can be run each evening (when shots aren't being run) to restart the action server, thus each morning the action server starts with an empty "connection id" table
  • the current shot's connections are added to the head of the table, thus table growth during the day won't have much impact on the speed of table lookups or adding new entires
  • this option involves little change to the existing code, plus avoids deadlock, thus is less risky than Option A

Summary:

  • The long-term solution is to write a new dispatcher using a more robust architecture.
  • However to meet the schedule of DIII-D's upcoming campaign, the faster and safest approach is to make minimal changes to the existing mdstcl dispatch feature.
  • Option B thus shows more promise than Option A.

mwinkel-dev avatar Apr 07 '24 19:04 mwinkel-dev

Now have a rough prototype of Option B. It dispatched 3,600 actions at full speed and only leaked one socket. The Linux top command shows a negligible memory footprint (i.e., as expected the growth of the "connection id" table is insignificant).

On Monday, will clean up the code and do more extensive testing. And if the results look good, will move Option B to the peer review stage of the development process.

mwinkel-dev avatar Apr 08 '24 03:04 mwinkel-dev

Option B prototype dispatched 21,600 actions with just 35 sockets leaked (a leak rate of ~0.2%). Actions were dispatched at the rate of 10 actions per second. Rough estimate is that the "connection ID" table grew to ~187 MB.

However, if remove the dispatch throttle of 10 actions/second, then mdstcl can deadlock when dispatching actions. That is a latent bug that has surely existed in the code for years; it is unrelated to the fix for this socket issue. (Nonetheless, it will be investigated before this socket fix is released to GA.)

When mdstcl deadlocked during testing, usually did not also examine the "action server". However, the one instance when did check revealed that the "action server" remained active (i.e., opened a new terminal window, started another mdstcl session and dispatched more actions). When additional mdstcl deadlocks are observed, more tests will be done to determine if the "action server" is always unaffected.

mwinkel-dev avatar Apr 08 '24 16:04 mwinkel-dev

There is likely another mode that causes mdstcl dispatch to leak sockets. It is associated with the queue length on the "action server".

Here are the details . . .

Because mdstcl dispatches actions and processes the associated "done" messages, it can estimate the number of actions that are queued up on the "action server". (For each queued action, two sockets are used.).

Configured mdstcl so it would maintain ~20 actions in the server's queue. And then dumped 400 actions into the dispatch hopper. It quickly ramped up to ~20 dispatched actions, and thereafter maintained steady state. When an action finished, it would dispatch another action and maintain the tally at ~20 dispatched-and-queued actions. The logs for mdstcl and the "action server" both indicated all 400 actions were processed. So was surprised to see that lsof reported ~20 leaked sockets.

Repeated the experiment with the limit set at ~100 actions. And observed that ~100 sockets leaked.

On Tuesday, will dig into this mystery. (There is a chance that there is a flaw in the design / implementation of the experiment.)

mwinkel-dev avatar Apr 09 '24 04:04 mwinkel-dev

This problem was introduced with PR #2288 on 19-May-2021. The bug appears to only affect the dispatch command of TCL. Thus, it is likely that the associated fix is low risk. (Testing is presently underway to confirm that is indeed true.)

I slowly remember the struggle we had sorting out the existing problem at the time while keeping backward compatibility. The issue was that the old protocol required a reconnection on an offered port to send the result of the action.

the original goals that i remember were to support:

  • cancelation of dispatched actions by event (e.g. user intervention, Ctrl+C)
  • timeout
  • parallel execution of actions

zack-vii avatar Apr 09 '24 15:04 zack-vii

Hi @zack-vii -- Thanks for the history of the change. Much appreciated!

I will expand my experiments to see if my proposed fix for the leaking sockets breaks any of the features you listed above.

mwinkel-dev avatar Apr 09 '24 16:04 mwinkel-dev