scuda icon indicating copy to clipboard operation
scuda copied to clipboard

The server is unable to respond to the client's second request

Open James-Leong opened this issue 8 months ago • 1 comments

Problem description Run the SCUDA server on a GPU server and run commands on another server. Normal during the first command execution, unable to respond during the second execution.

Environmental information CUDA_VERSION=12.6.2 DISTRO_VERSION=24.04 OS_DISTRO=ubuntu CUDNN_TAG=cudnn

Reproduce steps

  1. Build an image using the example dockerfile and start the container
  2. Use the command to start the server ./local.sh server
  3. Use the same image to start the client container on another server
  4. Set environment var export SCUDA_SERVER=x.x.x.x
  5. Use the command to start the client LD_PRELOAD=./libscuda_12.6.so python3 -c "print('hello world')"
  6. Execute the command again

Current behavior When running the command for the first time, it can execute normally and print the result.

root@xxx:/home# LD_PRELOAD=./libscuda_12.6.so python3 -c "print('hello world')"
Opening connection to server
decompression required; start decompress...
decompressed return::: : 44
compared return::: : 44
//
//
//
//
//

.version 8.5
.target sm_52
.address_size 64




hello world
root@xxx:/home# 

Execute the command again, the console has been waiting for a response from the server.

root@xxx:/home# LD_PRELOAD=./libscuda_12.6.so python3 -c "print('hello world')"
Opening connection to server
decompression required; start decompress...
decompressed return::: : 44
compared return::: : 44
//
//
//
//
//

.version 8.5
.target sm_52
.address_size 64




Only ctrl + C can exit.

Expected behavior The server should respond to every client request normally.

Reason for the problem This issue may be due to the incorrect use of global variable pthread_t tid: rpc.cpp

#include <sys/socket.h>

#include "rpc.h"
#include <iostream>
#include <string.h>
#include <unistd.h>

pthread_t tid;

void *_rpc_read_id_dispatch(void *p) {
  ...
}

// rpc_read_start waits for a response with a specific request id on the
// given connection. this function is used to wait for a response to a
// request that was sent with rpc_write_end.
//
// it is not necessary to call rpc_read_start() if it is the first call in
// the sequence because by convention, the handler owns the read lock on
// entry.
int rpc_dispatch(conn_t *conn, int parity) {
  if (tid == 0 &&
      pthread_create(&tid, nullptr, _rpc_read_id_dispatch, (void *)conn) < 0) {
    return -1;
  }

  ...
}

Only when tid == 0, will the method _rpc_read_id_dispatch be called to create a new thread to execute rpc reading. When the client first calls, the tid value is 0. At this time, the server responds normally, creates a sub thread to execute the read, and modifies the tid value to the thread ID of the sub thread; When the client calls again, the server cannot create a read thread again because the value of tid is not 0.

Repair suggestions Remove the global variable pthread_t tid, and add a variable in conn_t struct in rpc.h:

typedef struct {
  int connfd;

  int request_id;
  int read_id;
  int write_id;
  int write_op;

  pthread_t read_thread;
  pthread_t rpc_tid;  // replace the global variable

  ...
} conn_t;

Then amend the rpc.cpp as following:

#include <sys/socket.h>

#include "rpc.h"
#include <iostream>
#include <string.h>
#include <unistd.h>

// pthread_t tid;

void *_rpc_read_id_dispatch(void *p) {
  conn_t *conn = (conn_t *)p;

  ...

  conn->rpc_tid = 0;  // reset the thread id
  return NULL;
}

// rpc_read_start waits for a response with a specific request id on the
// given connection. this function is used to wait for a response to a
// request that was sent with rpc_write_end.
//
// it is not necessary to call rpc_read_start() if it is the first call in
// the sequence because by convention, the handler owns the read lock on
// entry.
int rpc_dispatch(conn_t *conn, int parity) {
  // use conn->rpc_tid instead of global tid
  if (conn->rpc_tid == 0 &&
      pthread_create(&conn->rpc_tid, nullptr, _rpc_read_id_dispatch, (void *)conn) < 0) {
    return -1;
  }

  ...
}

Perhaps I don't quite understand the code yet, so whether it's a bug still needs the author's help to determine. @kevmo314

James-Leong avatar May 07 '25 08:05 James-Leong

That seems plausible, we haven't tested two simultaneous requests yet so that may be a necessary change to make to support it.

kevmo314 avatar May 07 '25 11:05 kevmo314