lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Bug] 2卡internvl2-26b推理,卡间通信是pcie会失败,nvlink会成功,这是为啥

Open chestnut111 opened this issue 5 months ago • 12 comments

Checklist

  • [X] 1. I have searched related issues but cannot get the expected help.
  • [X] 2. The bug has not been fixed in the latest version.
  • [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

代码; backend_config = TurbomindEngineConfig(session_len=8192, tp=2, max_batch_size=2) pipe = pipeline(model, backend_config=backend_config) 报错; RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:246

Reproduction

python run.py

Environment

A800 80g
2卡

Error traceback

x

chestnut111 avatar Aug 30 '24 03:08 chestnut111