[Bug]8卡2080ti无法启动qwen2-72b-insctruct
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
8卡2080ti,22g显存版,使用vllm可以运行Qwen2-72B-Instruct,但是使用本系统不行,总是报内存溢出错误
Reproduction
命令行如下: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 lmdeploy serve api_server --model-name qwen2-72b-instruct --allow-origins * --tp 8 --log-level INFO --session-len 24000 --cache-max-entry-count 0.01 --model-format hf /root/.cache/Qwen2-72B-Instruct/
Environment
root@8075578dbbe6:/opt/lmdeploy# lmdeploy check_env
sys.platform: linux
Python: 3.8.10 (default, Mar 25 2024, 10:42:49) [GCC 9.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.1.0+cu118
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 11.8
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
- CuDNN 8.7
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.16.0+cu118
LMDeploy: 0.5.1+9cdce39
transformers: 4.42.4
gradio: 4.38.1
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.1.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX NV2 PIX NODE NODE NODE NODE 0-19,40-59 0 N/A
GPU1 PIX X PIX NV2 NODE NODE NODE NODE 0-19,40-59 0 N/A
GPU2 NV2 PIX X PIX NODE NODE NODE NODE 0-19,40-59 0 N/A
GPU3 PIX NV2 PIX X NODE NODE NODE NODE 0-19,40-59 0 N/A
GPU4 NODE NODE NODE NODE X NV2 PIX PIX 0-19,40-59 0 N/A
GPU5 NODE NODE NODE NODE NV2 X PIX PIX 0-19,40-59 0 N/A
GPU6 NODE NODE NODE NODE PIX PIX X NV2 0-19,40-59 0 N/A
GPU7 NODE NODE NODE NODE PIX PIX NV2 X 0-19,40-59 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
硬件环境
6133×2 512G 2080ti(22g魔改版)*10
rocky linux 8.8
docker 26.1.3
openmmlab/lmdeploy v0.5.1
Error traceback
root@8075578dbbe6:/opt/lmdeploy# CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 lmdeploy serve api_server --model-name qwen2-72b-instruct --allow-origins * --tp 8 --log-level INFO --session-len 24000 --cache-max-entry-count 0.01 --model-format hf /root/.cache/Qwen2-72B-Instruct/
2024-07-24 09:53:28,598 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name='qwen2-72b-instruct', model_format='hf', tp=8, session_len=24000, max_batch_size=128, cache_max_entry_count=0.01, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
2024-07-24 09:53:28,598 - lmdeploy - INFO - input chat_template_config=None
2024-07-24 09:53:29,359 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None)
2024-07-24 09:53:29,359 - lmdeploy - INFO - model_source: hf_model
2024-07-24 09:53:29,359 - lmdeploy - WARNING - model_name is deprecated in TurbomindEngineConfig and has no effect
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Device does not support bfloat16. Set float16 forcefully
2024-07-24 09:53:29,695 - lmdeploy - INFO - model_config:
[llama]
model_name = qwen
model_arch = Qwen2ForCausalLM
tensor_para_size = 8
head_num = 64
kv_head_num = 8
vocab_size = 152064
num_layer = 80
inter_size = 29568
norm_eps = 1e-06
attn_bias = 1
start_id = 151643
end_id = 151645
session_len = 24000
weight_type = fp16
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 0
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.01
cache_block_seq_len = 64
cache_chunk_size = -1
enable_prefix_caching = False
num_tokens_per_iter = 8192
max_prefill_iters = 3
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 32768
rope_scaling_factor = 0.0
use_dynamic_ntk = 0
use_logn_attn = 0
lora_policy =
lora_r = 0
lora_scale = 0.0
lora_max_wo_r = 0
lora_rank_pattern =
lora_scale_pattern =
[TM][WARNING] [LlamaTritonModel] `max_context_token_num` = 24000.
2024-07-24 09:53:33,319 - lmdeploy - WARNING - get 4643 model params
Convert to turbomind format: 0%| | 0/80 [00:00<?, ?it/s]Traceback (most recent call last):
File "/opt/py38/bin/lmdeploy", line 33, in <module>
sys.exit(load_entry_point('lmdeploy', 'console_scripts', 'lmdeploy')())
File "/opt/lmdeploy/lmdeploy/cli/entrypoint.py", line 36, in run
args.run(args)
File "/opt/lmdeploy/lmdeploy/cli/serve.py", line 315, in api_server
run_api_server(args.model_path,
File "/opt/lmdeploy/lmdeploy/serve/openai/api_server.py", line 1291, in serve
VariableInterface.async_engine = pipeline_class(
File "/opt/lmdeploy/lmdeploy/serve/async_engine.py", line 189, in __init__
self._build_turbomind(model_path=model_path,
File "/opt/lmdeploy/lmdeploy/serve/async_engine.py", line 234, in _build_turbomind
self.engine = tm.TurboMind.from_pretrained(
File "/opt/lmdeploy/lmdeploy/turbomind/turbomind.py", line 342, in from_pretrained
return cls(model_path=pretrained_model_name_or_path,
File "/opt/lmdeploy/lmdeploy/turbomind/turbomind.py", line 144, in __init__
self.model_comm = self._from_hf(model_source=model_source,
File "/opt/lmdeploy/lmdeploy/turbomind/turbomind.py", line 259, in _from_hf
output_model.export()
File "/opt/lmdeploy/lmdeploy/turbomind/deploy/target_model/base.py", line 283, in export
self.export_misc(bin)
File "/opt/lmdeploy/lmdeploy/turbomind/deploy/target_model/base.py", line 314, in export_misc
self.export_weight(emb, 'tok_embeddings.weight')
File "/opt/lmdeploy/lmdeploy/turbomind/deploy/target_model/base.py", line 229, in export_weight
torch_tensor = param.cuda().contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB. GPU 7 has a total capacty of 21.66 GiB of which 438.81 MiB is free. Process 219629 has 21.23 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
May try to "--max-batch-size 1" If it doesn't work, you may go for vLLM. It will take a while to optimize memory in LMDeploy. Don't let it to block your work
May try to "--max-batch-size 1" If it doesn't work, you may go for vLLM. It will take a while to optimize memory in LMDeploy. Don't let it to block your work
2024-08-01 03:24:58,259 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name='qwen2-72b-instruct', model_format='hf', tp=8, session_len=24000, max_batch_size=1, cache_max_entry_count=0.01, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-08-01 03:24:58,260 - lmdeploy - INFO - input chat_template_config=None 2024-08-01 03:24:58,540 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-08-01 03:24:58,540 - lmdeploy - INFO - model_source: hf_model 2024-08-01 03:24:58,540 - lmdeploy - WARNING - model_name is deprecated in TurbomindEngineConfig and has no effect Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Device does not support bfloat16. Set float16 forcefully 2024-08-01 03:24:58,882 - lmdeploy - INFO - model_config:
[llama] model_name = qwen model_arch = Qwen2ForCausalLM tensor_para_size = 8 head_num = 64 kv_head_num = 8 vocab_size = 152064 num_layer = 80 inter_size = 29568 norm_eps = 1e-06 attn_bias = 1 start_id = 151643 end_id = 151645 session_len = 24000 weight_type = fp16 rotary_embedding = 128 rope_theta = 1000000.0 size_per_head = 128 group_size = 0 max_batch_size = 1 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.01 cache_block_seq_len = 64 cache_chunk_size = -1 enable_prefix_caching = False num_tokens_per_iter = 8192 max_prefill_iters = 3 extra_tokens_per_iter = 0 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 32768 rope_scaling_factor = 0.0 use_dynamic_ntk = 0 use_logn_attn = 0 lora_policy = lora_r = 0 lora_scale = 0.0 lora_max_wo_r = 0 lora_rank_pattern = lora_scale_pattern =
[TM][WARNING] [LlamaTritonModel] max_context_token_num = 24000.
2024-08-01 03:25:02,417 - lmdeploy - WARNING - get 4643 model params
Convert to turbomind format: 0%| | 0/80 [00:00<?, ?it/s]Traceback (most recent call last):
File "/opt/py38/bin/lmdeploy", line 33, in