tpp connections: change round robin to fixed assignment of threads

Open vchlum opened this issue 2 years ago • 0 comments

Describe Bug or Feature

The pbs_comm can crash if a huge amount of requests(/new connections) are issued to pbs_comm. It can be also invoked by a few hundred/thousands invocations of pbs_rmget parallel running for a few minutes (yes. unsupported command, but rm protocol can show the error in tpp). The pbs_comm must utilize the CPUs to reproduce. Also, gss encryption was involved.

The main loop in tpp_transport.c:work() expects the thread-safely obtained conn can only be used in one thread at the same time.

There is a hidden race condition, and honestly, I wasn't able to find the very exact path of how the conn gets in the different thread.

In this stack trace, the fd 519 has a faulty conn structure, because the second handle_disconnect() is called before the first one is finished... at the time of this stack trace, the first handle_disconnect() is gone... As you can see the incorrect sock_fd later causes the crash.

(gdb) thread 5
[Switching to thread 5 (Thread 0x7f4afdb506c0 (LWP 835448))]
#0  handle_disconnect (conn=0x7f4af4020e00) at /tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c:1748
1748	/tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c: No such file or directory.
(gdb) bt
#0  handle_disconnect (conn=0x7f4af4020e00) at /tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c:1748
#1  0x00007f4aff734b0c in handle_incoming_data (conn=0x7f4af4020e00) at /tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c:1836
#2  0x00007f4aff73435e in work (v=0x55cec68ae180) at /tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c:1570
#3  0x00007f4aff579134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007f4aff5f97dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) select-frame 1
(gdb) print *conn
$1 = {sock_fd = -201326448, lasterr = -38470256, net_state = 1, ev_mask = 32586, conn_params = 0x7f4af40266e0, send_mbox = {mbox_name = "Conn_519\000", mbox_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, 
        __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\001", '\000' <repeats 22 times>, __align = 0}, mbox_queue = {head = 0x0, tail = 0x0}, max_size = 640000, 
    mbox_size = 0, mbox_eventfd = 520}, scratch = {chunk_link = {ll_prior = 0x0, ll_next = 0x0, ll_struct = 0x0}, data = 0x7f4aec5b59f0 "`\341S\354J\177", len = 8192, pos = 0x7f4aec5b59f0 "`\341S\354J\177"}, curr_send_pkt = 0x0, 
  td = 0x55cec68ae180, ctx = 0x0, extra = **0x0}**

Describe Your Change

I suggest fixing the expected condition - so the sentence 'the conn can always be associated only with one thread in work()' is true.

There was a round-robin for assigning conn to threads. The round-robin does not consider the real utilization of threads, so there is no harm in the following solution: The same fd is always assigned to the same thread.

While investigating, obvious errors were discovered, so the fixes are included:

send_tok was not initialized (based on gss doc - it should be)
in the case of PBS_GSS_ERR_CONTEXT_ESTABLISH, the old gss context remained in structure - should be changed before the return.
conn was set free without locking the mutex and setting related conns_array to free.

Link to Design Doc

Attach Test and Valgrind Logs/Output

I tested with commands like parallel -j 500 ./pbs_rmget -m torque4.grid.cesnet.cz -p 0 ncpus -- {1..100000} for a few hours after the fix with no crash. The pbs_comm utilized CPUs during the test.

Apr 18 '24 09:04 vchlum