[Bug] CUDA runtime error: an illegal memory access was encountered
Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
Describe the bug
模型:llama2-70B 设备:A100/40G × 4 lmdeploy版本:0.0.13
allocator对象貌似存在bug,持续运行时,有两种情况抛出错误了:
-
LlamaTritonModelInstance对象析构时,调用allocator->free(),出现段错误。
-
内部线程执行ContextDecode时,调用allocator->malloc(),出现cuda runtime error。
以上错误都是运行过程中随机出现的,可以正常处理一些请求。
程序是在一台cuda11.7版本的机器上编译,移到另一台cuda11.3的机器上运行的,有可能是cuda版本不一致而引起的吗?
Reproduction
void function() {
std::vector<std::unique_ptr<AbstractTransformerModelInstance>> model_instances;
std::vector<cudaStream_t> cuda_streams;
std::vector<std::thread> threads;
//创建model_instances
model_instances.resize((size_t)gpu_count);
cuda_streams.resize((size_t)gpu_count);
threads.clear();
for (int device_id = 0; device_id < gpu_count; device_id++) {
const int rank = node_id * gpu_count + device_id;
threads.emplace_back([this, device_id, rank, &model_instances, &cuda_streams]() {
ft::check_cuda_error(cudaSetDevice(device_id));
cudaStream_t stream;
ft::check_cuda_error(cudaStreamCreate(&stream));
cuda_streams.at(device_id) = stream;
auto model_instance = this->model->createModelInstance(device_id, rank, stream, this->nccl_comms, nullptr);
model_instances.at(device_id) = std::move(model_instance);
printf("model instance %d is created \n", device_id);
ft::print_mem_usage();
});
}
for (auto& t : threads) {
t.join();
}
//构造请求
//推理
threads.clear();
for (int device_id = 0; device_id < gpu_count; device_id++) {
threads.push_back(std::thread(threadForward,
&model_instances[device_id],
request_list[device_id],
&output_tensors_lists[device_id],
device_id,
instance_comm.get(),
node_id,
(void*)(&lmDeployRequest)));
}
for (auto& t : threads) {
t.join();
}
//释放model_instances
model_instances.clear();
//销毁句柄
for(int device_id = 0; device_id < gpu_count; device_id++) {
ft::check_cuda_error(cudaSetDevice(device_id));
cudaStream_t stream = cuda_streams.at(device_id);
ft::check_cuda_error(cudaStreamDestroy(stream));
}
//释放请求
}
以上是调用AbstractTransformerModelInstance进行推理的大致方式,劳烦看看有没有问题。
Environment
ubuntu-16.04
cuda-11.4
Error traceback
No response
打开了FT_DEBUG_LEVEL=DEBUG,貌似是执行LlamaV2::ContextDecode()的invokeInputIdsEmbeddingLookupPosEncoding()时出错了。
模型初始化参数如下:
这里用的是llama2-13B模型,单显卡。
在输入一个较长的序列时(input_length=1769)时出现。
- 错误1貌似是由于释放LlamaModelInstance后立即销毁cudaStream句柄导致的。但我看allocator.h的代码:
void free(void** ptr, bool _ = false) const {
...
check_cuda_error(cudaFreeAsync(*ptr, stream_));
cudaStreamSynchronize(stream_);
...
}
释放显存后是有同步等待的,我对cuda编程并不熟悉,望解答。 2. 错误2是由于输入的token_id不合法导致(token_id < 0 或 token_id >= vocab_size)
Hi @RytonLi May you try https://github.com/InternLM/lmdeploy/releases/tag/v0.2.3 after https://github.com/InternLM/lmdeploy/pull/1100 . If it's still reproducible, could you please update the reproducible steps? Thanks.