The server is unable to respond to the client's second request
Problem description Run the SCUDA server on a GPU server and run commands on another server. Normal during the first command execution, unable to respond during the second execution.
Environmental information CUDA_VERSION=12.6.2 DISTRO_VERSION=24.04 OS_DISTRO=ubuntu CUDNN_TAG=cudnn
Reproduce steps
- Build an image using the example dockerfile and start the container
- Use the command to start the server
./local.sh server - Use the same image to start the client container on another server
- Set environment var
export SCUDA_SERVER=x.x.x.x - Use the command to start the client
LD_PRELOAD=./libscuda_12.6.so python3 -c "print('hello world')" - Execute the command again
Current behavior When running the command for the first time, it can execute normally and print the result.
root@xxx:/home# LD_PRELOAD=./libscuda_12.6.so python3 -c "print('hello world')"
Opening connection to server
decompression required; start decompress...
decompressed return::: : 44
compared return::: : 44
//
//
//
//
//
.version 8.5
.target sm_52
.address_size 64
hello world
root@xxx:/home#
Execute the command again, the console has been waiting for a response from the server.
root@xxx:/home# LD_PRELOAD=./libscuda_12.6.so python3 -c "print('hello world')"
Opening connection to server
decompression required; start decompress...
decompressed return::: : 44
compared return::: : 44
//
//
//
//
//
.version 8.5
.target sm_52
.address_size 64
Only ctrl + C can exit.
Expected behavior The server should respond to every client request normally.
Reason for the problem
This issue may be due to the incorrect use of global variable pthread_t tid:
rpc.cpp
#include <sys/socket.h>
#include "rpc.h"
#include <iostream>
#include <string.h>
#include <unistd.h>
pthread_t tid;
void *_rpc_read_id_dispatch(void *p) {
...
}
// rpc_read_start waits for a response with a specific request id on the
// given connection. this function is used to wait for a response to a
// request that was sent with rpc_write_end.
//
// it is not necessary to call rpc_read_start() if it is the first call in
// the sequence because by convention, the handler owns the read lock on
// entry.
int rpc_dispatch(conn_t *conn, int parity) {
if (tid == 0 &&
pthread_create(&tid, nullptr, _rpc_read_id_dispatch, (void *)conn) < 0) {
return -1;
}
...
}
Only when tid == 0, will the method _rpc_read_id_dispatch be called to create a new thread to execute rpc reading.
When the client first calls, the tid value is 0. At this time, the server responds normally, creates a sub thread to execute the read, and modifies the tid value to the thread ID of the sub thread;
When the client calls again, the server cannot create a read thread again because the value of tid is not 0.
Repair suggestions
Remove the global variable pthread_t tid, and add a variable in conn_t struct in rpc.h:
typedef struct {
int connfd;
int request_id;
int read_id;
int write_id;
int write_op;
pthread_t read_thread;
pthread_t rpc_tid; // replace the global variable
...
} conn_t;
Then amend the rpc.cpp as following:
#include <sys/socket.h>
#include "rpc.h"
#include <iostream>
#include <string.h>
#include <unistd.h>
// pthread_t tid;
void *_rpc_read_id_dispatch(void *p) {
conn_t *conn = (conn_t *)p;
...
conn->rpc_tid = 0; // reset the thread id
return NULL;
}
// rpc_read_start waits for a response with a specific request id on the
// given connection. this function is used to wait for a response to a
// request that was sent with rpc_write_end.
//
// it is not necessary to call rpc_read_start() if it is the first call in
// the sequence because by convention, the handler owns the read lock on
// entry.
int rpc_dispatch(conn_t *conn, int parity) {
// use conn->rpc_tid instead of global tid
if (conn->rpc_tid == 0 &&
pthread_create(&conn->rpc_tid, nullptr, _rpc_read_id_dispatch, (void *)conn) < 0) {
return -1;
}
...
}
Perhaps I don't quite understand the code yet, so whether it's a bug still needs the author's help to determine. @kevmo314
That seems plausible, we haven't tested two simultaneous requests yet so that may be a necessary change to make to support it.