lmdeploy [Bug] CUDA runtime error: an illegal memory access was encountered

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.

Describe the bug

模型：llama2-70B 设备：A100/40G × 4 lmdeploy版本：0.0.13

allocator对象貌似存在bug，持续运行时，有两种情况抛出错误了：

LlamaTritonModelInstance对象析构时，调用allocator->free()，出现段错误。
内部线程执行ContextDecode时，调用allocator->malloc()，出现cuda runtime error。

以上错误都是运行过程中随机出现的，可以正常处理一些请求。

程序是在一台cuda11.7版本的机器上编译，移到另一台cuda11.3的机器上运行的，有可能是cuda版本不一致而引起的吗？

Reproduction


void function() {
    std::vector<std::unique_ptr<AbstractTransformerModelInstance>> model_instances;
    std::vector<cudaStream_t> cuda_streams;
    std::vector<std::thread>  threads;

    //创建model_instances
    model_instances.resize((size_t)gpu_count);
    cuda_streams.resize((size_t)gpu_count);
    threads.clear();
    for (int device_id = 0; device_id < gpu_count; device_id++) {
        const int rank = node_id * gpu_count + device_id;
        threads.emplace_back([this, device_id, rank, &model_instances, &cuda_streams]() {
          ft::check_cuda_error(cudaSetDevice(device_id));
          cudaStream_t stream;
          ft::check_cuda_error(cudaStreamCreate(&stream));
          cuda_streams.at(device_id) = stream;

          auto model_instance = this->model->createModelInstance(device_id, rank, stream, this->nccl_comms, nullptr);
          model_instances.at(device_id) = std::move(model_instance);
          printf("model instance %d is created \n", device_id);
          ft::print_mem_usage();
        });
    }
    for (auto& t : threads) {
        t.join();
    }
  
    //构造请求
    
    //推理
    threads.clear();
        for (int device_id = 0; device_id < gpu_count; device_id++) {
            threads.push_back(std::thread(threadForward,
                                          &model_instances[device_id],
                                          request_list[device_id],
                                          &output_tensors_lists[device_id],
                                          device_id,
                                          instance_comm.get(),
                                          node_id,
                                          (void*)(&lmDeployRequest)));
        }
        for (auto& t : threads) {
            t.join();
        }
   
   //释放model_instances
   model_instances.clear();
    
    //销毁句柄
    for(int device_id = 0; device_id < gpu_count; device_id++) {
        ft::check_cuda_error(cudaSetDevice(device_id));
        cudaStream_t stream = cuda_streams.at(device_id);
        ft::check_cuda_error(cudaStreamDestroy(stream));
    }
    
    //释放请求
}

以上是调用AbstractTransformerModelInstance进行推理的大致方式，劳烦看看有没有问题。

Environment

ubuntu-16.04
cuda-11.4

Error traceback

No response

Nov 16 '23 02:11 RytonLi

打开了FT_DEBUG_LEVEL=DEBUG，貌似是执行LlamaV2::ContextDecode()的invokeInputIdsEmbeddingLookupPosEncoding()时出错了。

模型初始化参数如下：这里用的是llama2-13B模型，单显卡。

在输入一个较长的序列时（input_length=1769）时出现。

Nov 16 '23 08:11 RytonLi

错误1貌似是由于释放LlamaModelInstance后立即销毁cudaStream句柄导致的。但我看allocator.h的代码：

void free(void** ptr, bool _ = false) const {
    ...
    check_cuda_error(cudaFreeAsync(*ptr, stream_));
    cudaStreamSynchronize(stream_);
    ...
}

释放显存后是有同步等待的，我对cuda编程并不熟悉，望解答。 2. 错误2是由于输入的token_id不合法导致（token_id < 0 或 token_id >= vocab_size）

Nov 17 '23 01:11 RytonLi

Hi @RytonLi May you try https://github.com/InternLM/lmdeploy/releases/tag/v0.2.3 after https://github.com/InternLM/lmdeploy/pull/1100 . If it's still reproducible, could you please update the reproducible steps? Thanks.

Feb 22 '24 09:02 zhyncs