mdsplus icon indicating copy to clipboard operation
mdsplus copied to clipboard

Mdstcl's "dispatch" feature can trigger segfaults and other errors if flood the "action server" with hundreds of actions

Open mwinkel-dev opened this issue 10 months ago • 4 comments

Affiliation MIT PSFC

Version(s) Affected Found in alpha_7.140.71, but likely has been present for a few years.

Platform Found on Ubuntu 20, but surely exists on other platforms too.

Describe the bug As a stress test of mdstcl and the "action server", created a spike load of hundreds of actions. When applied at the fastest rate, segfaults were common. When applied at a slightly slower rate, various errors appeared in the TCL output regarding failed communication with the "action server".

To Reproduce Steps to reproduce the behavior:

  1. Create a model tree of 400 actions, each action evaluates a simple numeric expression (e.g., just the number: 1).
  2. Make a TCL script file that creates two or three pulses (shots) of the model, and dispatches actions for each pulse.
  3. Run that script three or four times and it will usually segfault.
  4. Adding a wait(0.1); at the start of each action usually allows it to run without error. (However, see the "Additional Context" section below.)
  5. Alternatively, edit the source code and add a 0.1 second delay in the ServerActionDispatch.c file.
  6. Compile the source and run the TCL script again. The delay eliminates the segfault, but mdstcl will occasionally display errors regarding server_connect() and get_bytes_to().

Expected behavior No segfaults or errors no matter how fast a spike load is applied. The "action server" should throttle / queue the load and mdstcl should wait patiently until the "action server" is able to accept more actions.

Screenshots n/a

Additional context The alpha branch leaks sockets (see Issue #2731). And thus can only process ~800 actions without exceeding the socket limit. If troubleshooting this issue requires more actions than that, will have to use the fix for Issue #2731.

This problem was found while investigating Issue #2731. Because that issue is a "US Priority" task, also assigning this issue the same priority.

All programs (mdstcl, "action server" and "tree server") were running on the same Ubuntu 20 system. And mdstcl used thick-client to access the trees.

mwinkel-dev avatar Apr 05 '24 20:04 mwinkel-dev

Preliminary investigation points to three problems:

  • segfault occurs when dispatching at high rates (e.g., 100+ actions per second) and can be mitigated by throttling to 10 actions per second
  • code is not properly handling timeout condition when reading the message body from the action server
  • connection id is being overwritten occasionally

Because these problems haven't been reported by customers, it implies that the spike load test is an edge case. Apparently, the above problems rarely arise at customer sites.

Nonetheless, the root causes need to be identified. And also fixed (if practical to do so).

mwinkel-dev avatar Apr 06 '24 20:04 mwinkel-dev

The connection id being overwritten might be a threading problem.

mwinkel-dev avatar Apr 06 '24 22:04 mwinkel-dev

The connection id was indeed being clobbered by another thread. However, that is part of issue #2731, so is described there.

And there is a fourth problem:

  • Improper checking of status returned by functions that use SsINTERNAL

mwinkel-dev avatar Apr 07 '24 03:04 mwinkel-dev

Turns out that the throttle on the dispatch rate is also part of Issue #2731. So, is described there.

mwinkel-dev avatar Apr 07 '24 18:04 mwinkel-dev

Hm, I have a hunch it may be due to the sheer number of socket connections that the server has to handle in a short period of time. The dispather has a main select loop that handlees incomming new reply connections as well as answering reply sockets. for one there may be a limit on how many sockets select can handle. furthermore since it also handles the replies (which safes overhead for keeping the reply_socket list up to date) the backlog of the listening server may simply be exhausted. on could scale it probably by dynamically extending the number of reply threads and thus hosting additional listening reply sockets. the dispatcher could then do some kind of load balancing, e.g. find reply_thread with minimal number of active sockets, if that number is less than X, use that reply_threads listener, otherwise spawn additional reply_thread.

zack-vii avatar Apr 16 '24 05:04 zack-vii

Closing this issue because it is highly unlikely to arise in normal workflows.

Here are the details . . .

Issue #2731 was leaking sockets because it was opening a new connection for every action that was dispatched. The initial investigation of that issue incorrectly assumed 1) that a connection per action was normal behavior, and 2) that the cause of the leaking sockets was that connections were never being closed. Thus the test harness described above was simultaneously creating and closing hundreds of network connections. Unsurprisingly, there are some "race" conditions in that circumstance.

Note however, that the actual issue was that connections were not being re-used. (Now fixed by PR #2740.) During normal operation, mdstcl uses a single connection to dispatch actions to a specific "action service". (If there are N "action services", typically each on a different computer, then mdstcl has N network connections it uses for dispatching actions. And another N connections to receive replies from the "action services" as actions are completed.)

The test harness described above is thus an extreme test case and is very unlikely to occur in normal workflows. Furthermore, the other issues spotted during the investigation of Issue #2731 have been fixed with PR #2746.

Therefore, closing this issue.

mwinkel-dev avatar Apr 24 '24 18:04 mwinkel-dev