lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Bug] CUDA runtime error: an illegal memory access was encountered

Open RytonLi opened this issue 2 years ago • 3 comments

Checklist

  • [ ] 1. I have searched related issues but cannot get the expected help.
  • [ ] 2. The bug has not been fixed in the latest version.

Describe the bug

模型:llama2-70B 设备:A100/40G × 4 lmdeploy版本:0.0.13

allocator对象貌似存在bug,持续运行时,有两种情况抛出错误了:

  1. LlamaTritonModelInstance对象析构时,调用allocator->free(),出现段错误。 Snipaste_2023-11-16_10-41-28

  2. 内部线程执行ContextDecode时,调用allocator->malloc(),出现cuda runtime error。 image

以上错误都是运行过程中随机出现的,可以正常处理一些请求。

程序是在一台cuda11.7版本的机器上编译,移到另一台cuda11.3的机器上运行的,有可能是cuda版本不一致而引起的吗?

Reproduction


void function() {
    std::vector<std::unique_ptr<AbstractTransformerModelInstance>> model_instances;
    std::vector<cudaStream_t> cuda_streams;
    std::vector<std::thread>  threads;

    //创建model_instances
    model_instances.resize((size_t)gpu_count);
    cuda_streams.resize((size_t)gpu_count);
    threads.clear();
    for (int device_id = 0; device_id < gpu_count; device_id++) {
        const int rank = node_id * gpu_count + device_id;
        threads.emplace_back([this, device_id, rank, &model_instances, &cuda_streams]() {
          ft::check_cuda_error(cudaSetDevice(device_id));
          cudaStream_t stream;
          ft::check_cuda_error(cudaStreamCreate(&stream));
          cuda_streams.at(device_id) = stream;

          auto model_instance = this->model->createModelInstance(device_id, rank, stream, this->nccl_comms, nullptr);
          model_instances.at(device_id) = std::move(model_instance);
          printf("model instance %d is created \n", device_id);
          ft::print_mem_usage();
        });
    }
    for (auto& t : threads) {
        t.join();
    }
  
    //构造请求
    
    //推理
    threads.clear();
        for (int device_id = 0; device_id < gpu_count; device_id++) {
            threads.push_back(std::thread(threadForward,
                                          &model_instances[device_id],
                                          request_list[device_id],
                                          &output_tensors_lists[device_id],
                                          device_id,
                                          instance_comm.get(),
                                          node_id,
                                          (void*)(&lmDeployRequest)));
        }
        for (auto& t : threads) {
            t.join();
        }
   
   //释放model_instances
   model_instances.clear();
    
    //销毁句柄
    for(int device_id = 0; device_id < gpu_count; device_id++) {
        ft::check_cuda_error(cudaSetDevice(device_id));
        cudaStream_t stream = cuda_streams.at(device_id);
        ft::check_cuda_error(cudaStreamDestroy(stream));
    }
    
    //释放请求
}

以上是调用AbstractTransformerModelInstance进行推理的大致方式,劳烦看看有没有问题。

Environment

ubuntu-16.04
cuda-11.4

Error traceback

No response

RytonLi avatar Nov 16 '23 02:11 RytonLi

image

打开了FT_DEBUG_LEVEL=DEBUG,貌似是执行LlamaV2::ContextDecode()的invokeInputIdsEmbeddingLookupPosEncoding()时出错了。

模型初始化参数如下: image 这里用的是llama2-13B模型,单显卡。

在输入一个较长的序列时(input_length=1769)时出现。

RytonLi avatar Nov 16 '23 08:11 RytonLi

  1. 错误1貌似是由于释放LlamaModelInstance后立即销毁cudaStream句柄导致的。但我看allocator.h的代码:
void free(void** ptr, bool _ = false) const {
    ...
    check_cuda_error(cudaFreeAsync(*ptr, stream_));
    cudaStreamSynchronize(stream_);
    ...
}

释放显存后是有同步等待的,我对cuda编程并不熟悉,望解答。 2. 错误2是由于输入的token_id不合法导致(token_id < 0 或 token_id >= vocab_size)

RytonLi avatar Nov 17 '23 01:11 RytonLi

Hi @RytonLi May you try https://github.com/InternLM/lmdeploy/releases/tag/v0.2.3 after https://github.com/InternLM/lmdeploy/pull/1100 . If it's still reproducible, could you please update the reproducible steps? Thanks.

zhyncs avatar Feb 22 '24 09:02 zhyncs