lmdeploy
lmdeploy copied to clipboard
[Bug] 运行glm4-9b的时候报错
Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
[1] 2771397 floating point exception lmdeploy serve api_server --backend turbomind --model-name chatglm4 --tp 4
Reproduction
lmdeploy serve api_server /home/mingqiang/model/model_file/origin_model/glm-4-9b-chat --backend turbomind --model-name chatglm4 --tp 4 --server-port 10000 --cache-max-entry-count 0.1
Environment
sys.platform: linux
Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA A800 80GB PCIe
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Debian 10.2.1-6) 10.2.1 20210110
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 12.1
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 8.9.2
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
TorchVision: 0.17.2+cu121
LMDeploy: 0.5.1+
transformers: 4.40.1
gradio: Not Found
fastapi: 0.110.2
pydantic: 2.7.1
triton: 2.2.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX PIX PIX 0-27,56-83 0 N/A
GPU1 PIX X PIX PIX 0-27,56-83 0 N/A
GPU2 PIX PIX X PIX 0-27,56-83 0 N/A
GPU3 PIX PIX PIX X 0-27,56-83 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Error traceback
No response
In the glm4 model, there are only 2 key-value (KV) heads available, making it impossible to evenly partition among 4 GPUs. Please set tp=2 or tp=1.
The chat template name is supposed to be glm4 in the latest version.
In the glm4 model, there are only 2 key-value (KV) heads available, making it impossible to evenly partition among 4 GPUs. Please set tp=2 or tp=1. The chat template name is supposed to be
glm4in the latest version.
Excuse me, I have a parameter called 'tensor-parallel-size' in vllm that can be set to 4 or 8 to run glm-9b. What is the difference between this and 'tp'?
Sorry, I don't know how vllm implements it.
Sorry, I don't know how vllm implements it.
Understand, I currently have four 3080ti graphics cards with 12GB of VRAM each. I want to start using the lmdeploy serve api_server method. If I use the --tp 4 option, it will report a floating point exception, and if I use --tp 2, it will report insufficient VRAM. Are there any solutions? Thank you
Have you tried "--cache-max-entry-count 0.1" when using "--tp 2"?
Have you tried "--cache-max-entry-count 0.1" when using "--tp 2"?
I tried --cache-max-entry-count 0.01 and still failed, but "lmdeploy chat /app/models/glm-4-9b-chat --tp 2" can actually work, just cannt use lmdeploy serve api_server. Thanks
Sorry, I don't know how vllm implements it.
Understand, I currently have four 3080ti graphics cards with 12GB of VRAM each. I want to start using the lmdeploy serve api_server method. If I use the --tp 4 option, it will report a floating point exception, and if I use --tp 2, it will report insufficient VRAM. Are there any solutions? Thank you
Could you add --log-level INFO when you launch the server and share the error log?
Sorry, I don't know how vllm implements it.
Understand, I currently have four 3080ti graphics cards with 12GB of VRAM each. I want to start using the lmdeploy serve api_server method. If I use the --tp 4 option, it will report a floating point exception, and if I use --tp 2, it will report insufficient VRAM. Are there any solutions? Thank you
Could you add --log-level INFO when you launch the server and share the error log?
CUDA_VISIBLE_DEVICES=0,1 lmdeploy serve api_server /app/models/glm-4-9b-chat --server-port 11434 --model-name glm4 --tp 2
--cache-max-entry-count 0.01 --log-level INFO
log.txt
Why did this 9b model use two cards, each with 70GB of VRAM = =
I used the default --cache-max-entry-count 0.8
It's A100 80G. And your GPU is A800 80G. The memory is quite enough to launch the service with default value. I have no idea why it doesn't work at your side. I'd better add INFO logs when malloc memory
It's A100 80G. And your GPU is A800 80G. The memory is quite enough to launch the service with default value. I have no idea why it doesn't work at your side. I'd better add INFO logs when malloc memory
Mine is 3080ti 12G *2, I suppose it's enough for a 9b model, as I can use "lmdeploy chat" to lanuch the model and chat . It's quite strange that use "lmdeploy serve" needs so much memory
Oh, you are not the user who opened this issue :joy:
Can you try "--max-batch-size 1" at your side? "lmdeploy chat" set "--max-batch-size" 1 as default while "lmdeploy serve" makes it 128
Oh, you are not the user who opened this issue 😂
Can you try "--max-batch-size 1" at your side? "lmdeploy chat" set "--max-batch-size" 1 as default while "lmdeploy serve" makes it 128
(lmdeploy) (base) root@172-16-103-221:/app/code# CUDA_VISIBLE_DEVICES=5,6 lmdeploy serve api_server /app/models/glm-4-9b-chat --server-port 11434 --model-name glm4 --tp 2 \
--max-batch-size 1 --cache-max-entry-count 0.1 --log-level INFO 2024-07-26 12:13:24,537 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name='glm4', model_format=None, tp=2, session_len=None, max_batch_size=1, cache_max_entry_count=0.1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-07-26 12:13:24,537 - lmdeploy - INFO - input chat_template_config=None 2024-07-26 12:13:24,599 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='glm4', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-07-26 12:13:24,599 - lmdeploy - INFO - model_source: hf_model 2024-07-26 12:13:24,599 - lmdeploy - WARNING - model_name is deprecated in TurbomindEngineConfig and has no effect 2024-07-26 12:13:25,948 - lmdeploy - INFO - model_config:
[llama] model_name = glm4 model_arch = ChatGLMModel tensor_para_size = 2 head_num = 32 kv_head_num = 2 vocab_size = 151552 num_layer = 40 inter_size = 13696 norm_eps = 1.5625e-07 attn_bias = 1 start_id = 0 end_id = 151329 session_len = 131080 weight_type = bf16 rotary_embedding = 64 rope_theta = 5000000.0 size_per_head = 128 group_size = 0 max_batch_size = 1 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.1 cache_block_seq_len = 64 cache_chunk_size = -1 enable_prefix_caching = False num_tokens_per_iter = 8192 max_prefill_iters = 17 extra_tokens_per_iter = 0 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 131072 rope_scaling_factor = 0.0 use_dynamic_ntk = 0 use_logn_attn = 0 lora_policy = lora_r = 0 lora_scale = 0.0 lora_max_wo_r = 0 lora_rank_pattern = lora_scale_pattern =
[TM][WARNING] [LlamaTritonModel] max_context_token_num = 131080.
2024-07-26 12:13:27,045 - lmdeploy - WARNING - get 643 model params
2024-07-26 12:13:35,806 - lmdeploy - INFO - updated backend_config=TurbomindEngineConfig(model_name='glm4', model_format=None, tp=2, session_len=None, max_batch_size=1, cache_max_entry_count=0.1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
[TM][WARNING] Device 0 peer access Device 1 is not available.
[TM][WARNING] Device 1 peer access Device 0 is not available.
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[TM][INFO] [BlockManager] block_size = 1 MB
[TM][INFO] [BlockManager] block_size = 1 MB
[TM][INFO] [BlockManager] max_block_count = 115
[TM][INFO] [BlockManager] max_block_count = 115
[TM][INFO] [BlockManager] chunk_size = 115
[TM][INFO] [BlockManager] chunk_size = 115
[TM][WARNING] No enough blocks for session_len (131080), session_len truncated to 7360.
Exception in thread Thread-6:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Exception in thread Thread-7:
self.run()
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 870, in run
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance
model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance model_inst = self.tm_model.model_comm.create_model_instance( RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231
= = , just cannt make it work
It looks that we really need to make efforts on memory management. Sorry for the inconvenient.
May be fixed by #2201
