lmdeploy [Bug]8卡2080ti无法启动qwen2-72b-insctruct

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

8卡2080ti，22g显存版，使用vllm可以运行Qwen2-72B-Instruct，但是使用本系统不行，总是报内存溢出错误

Reproduction

命令行如下： CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 lmdeploy serve api_server --model-name qwen2-72b-instruct --allow-origins * --tp 8 --log-level INFO --session-len 24000 --cache-max-entry-count 0.01 --model-format hf /root/.cache/Qwen2-72B-Instruct/

Environment

root@8075578dbbe6:/opt/lmdeploy# lmdeploy check_env
sys.platform: linux
Python: 3.8.10 (default, Mar 25 2024, 10:42:49) [GCC 9.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.1.0+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.0+cu118
LMDeploy: 0.5.1+9cdce39
transformers: 4.42.4
gradio: 4.38.1
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.1.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     NV2     PIX     NODE    NODE    NODE    NODE    0-19,40-59      0               N/A
GPU1    PIX      X      PIX     NV2     NODE    NODE    NODE    NODE    0-19,40-59      0               N/A
GPU2    NV2     PIX      X      PIX     NODE    NODE    NODE    NODE    0-19,40-59      0               N/A
GPU3    PIX     NV2     PIX      X      NODE    NODE    NODE    NODE    0-19,40-59      0               N/A
GPU4    NODE    NODE    NODE    NODE     X      NV2     PIX     PIX     0-19,40-59      0               N/A
GPU5    NODE    NODE    NODE    NODE    NV2      X      PIX     PIX     0-19,40-59      0               N/A
GPU6    NODE    NODE    NODE    NODE    PIX     PIX      X      NV2     0-19,40-59      0               N/A
GPU7    NODE    NODE    NODE    NODE    PIX     PIX     NV2      X      0-19,40-59      0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

硬件环境
6133×2 512G 2080ti（22g魔改版）*10
rocky linux 8.8
docker 26.1.3
openmmlab/lmdeploy   v0.5.1

Error traceback

root@8075578dbbe6:/opt/lmdeploy# CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 lmdeploy serve api_server  --model-name qwen2-72b-instruct --allow-origins * --tp 8 --log-level INFO --session-len 24000 --cache-max-entry-count 0.01 --model-format hf /root/.cache/Qwen2-72B-Instruct/
2024-07-24 09:53:28,598 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name='qwen2-72b-instruct', model_format='hf', tp=8, session_len=24000, max_batch_size=128, cache_max_entry_count=0.01, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
2024-07-24 09:53:28,598 - lmdeploy - INFO - input chat_template_config=None
2024-07-24 09:53:29,359 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None)
2024-07-24 09:53:29,359 - lmdeploy - INFO - model_source: hf_model
2024-07-24 09:53:29,359 - lmdeploy - WARNING - model_name is deprecated in TurbomindEngineConfig and has no effect
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Device does not support bfloat16. Set float16 forcefully
2024-07-24 09:53:29,695 - lmdeploy - INFO - model_config:

[llama]
model_name = qwen
model_arch = Qwen2ForCausalLM
tensor_para_size = 8
head_num = 64
kv_head_num = 8
vocab_size = 152064
num_layer = 80
inter_size = 29568
norm_eps = 1e-06
attn_bias = 1
start_id = 151643
end_id = 151645
session_len = 24000
weight_type = fp16
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 0
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.01
cache_block_seq_len = 64
cache_chunk_size = -1
enable_prefix_caching = False
num_tokens_per_iter = 8192
max_prefill_iters = 3
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 32768
rope_scaling_factor = 0.0
use_dynamic_ntk = 0
use_logn_attn = 0
lora_policy = 
lora_r = 0
lora_scale = 0.0
lora_max_wo_r = 0
lora_rank_pattern = 
lora_scale_pattern = 


[TM][WARNING] [LlamaTritonModel] `max_context_token_num` = 24000.
2024-07-24 09:53:33,319 - lmdeploy - WARNING - get 4643 model params
Convert to turbomind format:   0%|                                                                                                     | 0/80 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/opt/py38/bin/lmdeploy", line 33, in <module>
    sys.exit(load_entry_point('lmdeploy', 'console_scripts', 'lmdeploy')())
  File "/opt/lmdeploy/lmdeploy/cli/entrypoint.py", line 36, in run
    args.run(args)
  File "/opt/lmdeploy/lmdeploy/cli/serve.py", line 315, in api_server
    run_api_server(args.model_path,
  File "/opt/lmdeploy/lmdeploy/serve/openai/api_server.py", line 1291, in serve
    VariableInterface.async_engine = pipeline_class(
  File "/opt/lmdeploy/lmdeploy/serve/async_engine.py", line 189, in __init__
    self._build_turbomind(model_path=model_path,
  File "/opt/lmdeploy/lmdeploy/serve/async_engine.py", line 234, in _build_turbomind
    self.engine = tm.TurboMind.from_pretrained(
  File "/opt/lmdeploy/lmdeploy/turbomind/turbomind.py", line 342, in from_pretrained
    return cls(model_path=pretrained_model_name_or_path,
  File "/opt/lmdeploy/lmdeploy/turbomind/turbomind.py", line 144, in __init__
    self.model_comm = self._from_hf(model_source=model_source,
  File "/opt/lmdeploy/lmdeploy/turbomind/turbomind.py", line 259, in _from_hf
    output_model.export()
  File "/opt/lmdeploy/lmdeploy/turbomind/deploy/target_model/base.py", line 283, in export
    self.export_misc(bin)
  File "/opt/lmdeploy/lmdeploy/turbomind/deploy/target_model/base.py", line 314, in export_misc
    self.export_weight(emb, 'tok_embeddings.weight')
  File "/opt/lmdeploy/lmdeploy/turbomind/deploy/target_model/base.py", line 229, in export_weight
    torch_tensor = param.cuda().contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB. GPU 7 has a total capacty of 21.66 GiB of which 438.81 MiB is free. Process 219629 has 21.23 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jul 24 '24 09:07 bltcn

May try to "--max-batch-size 1" If it doesn't work, you may go for vLLM. It will take a while to optimize memory in LMDeploy. Don't let it to block your work

Jul 26 '24 04:07 lvhan028

May try to "--max-batch-size 1" If it doesn't work, you may go for vLLM. It will take a while to optimize memory in LMDeploy. Don't let it to block your work

2024-08-01 03:24:58,259 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name='qwen2-72b-instruct', model_format='hf', tp=8, session_len=24000, max_batch_size=1, cache_max_entry_count=0.01, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-08-01 03:24:58,260 - lmdeploy - INFO - input chat_template_config=None 2024-08-01 03:24:58,540 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-08-01 03:24:58,540 - lmdeploy - INFO - model_source: hf_model 2024-08-01 03:24:58,540 - lmdeploy - WARNING - model_name is deprecated in TurbomindEngineConfig and has no effect Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Device does not support bfloat16. Set float16 forcefully 2024-08-01 03:24:58,882 - lmdeploy - INFO - model_config:

[llama] model_name = qwen model_arch = Qwen2ForCausalLM tensor_para_size = 8 head_num = 64 kv_head_num = 8 vocab_size = 152064 num_layer = 80 inter_size = 29568 norm_eps = 1e-06 attn_bias = 1 start_id = 151643 end_id = 151645 session_len = 24000 weight_type = fp16 rotary_embedding = 128 rope_theta = 1000000.0 size_per_head = 128 group_size = 0 max_batch_size = 1 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.01 cache_block_seq_len = 64 cache_chunk_size = -1 enable_prefix_caching = False num_tokens_per_iter = 8192 max_prefill_iters = 3 extra_tokens_per_iter = 0 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 32768 rope_scaling_factor = 0.0 use_dynamic_ntk = 0 use_logn_attn = 0 lora_policy = lora_r = 0 lora_scale = 0.0 lora_max_wo_r = 0 lora_rank_pattern = lora_scale_pattern =

[TM][WARNING] [LlamaTritonModel] max_context_token_num = 24000. 2024-08-01 03:25:02,417 - lmdeploy - WARNING - get 4643 model params Convert to turbomind format: 0%| | 0/80 [00:00<?, ?it/s]Traceback (most recent call last): File "/opt/py38/bin/lmdeploy", line 33, in sys.exit(load_entry_point('lmdeploy', 'console_scripts', 'lmdeploy')()) File "/opt/lmdeploy/lmdeploy/cli/entrypoint.py", line 36, in run args.run(args) File "/opt/lmdeploy/lmdeploy/cli/serve.py", line 315, in api_server run_api_server(args.model_path, File "/opt/lmdeploy/lmdeploy/serve/openai/api_server.py", line 1291, in serve VariableInterface.async_engine = pipeline_class( File "/opt/lmdeploy/lmdeploy/serve/async_engine.py", line 189, in init self._build_turbomind(model_path=model_path, File "/opt/lmdeploy/lmdeploy/serve/async_engine.py", line 234, in _build_turbomind self.engine = tm.TurboMind.from_pretrained( File "/opt/lmdeploy/lmdeploy/turbomind/turbomind.py", line 342, in from_pretrained return cls(model_path=pretrained_model_name_or_path, File "/opt/lmdeploy/lmdeploy/turbomind/turbomind.py", line 144, in init self.model_comm = self._from_hf(model_source=model_source, File "/opt/lmdeploy/lmdeploy/turbomind/turbomind.py", line 259, in _from_hf output_model.export() File "/opt/lmdeploy/lmdeploy/turbomind/deploy/target_model/base.py", line 283, in export self.export_misc(bin) File "/opt/lmdeploy/lmdeploy/turbomind/deploy/target_model/base.py", line 314, in export_misc self.export_weight(emb, 'tok_embeddings.weight') File "/opt/lmdeploy/lmdeploy/turbomind/deploy/target_model/base.py", line 229, in export_weight torch_tensor = param.cuda().contiguous() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB. GPU 7 has a total capacty of 21.66 GiB of which 462.81 MiB is free. Process 2640208 has 21.21 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Aug 01 '24 03:08 bltcn