System Info / 系統信息

CUda 11.7 ，Python 3.10.12 gpu V100 32G 显存。 vllm 0.5.4 vllm-flash-attn 2.6.1 其他按照basic_demo里面的requirements安装的。

Who can help? / 谁可以帮助到您？

@wwewwt @Sengxian @davidlvxin @codazzy @

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

python openai_api_server.py WARNING 08-27 13:59:59 _custom_ops.py:15] Failed to import from vllm._C with ImportError('libcudart.so.12: cannot open shared object file: No such file or directory') INFO 08-27 14:00:04 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/data/ChatGLM-6B/conf/models/glm-4-9b-chat', speculative_config=None, tokenizer='/data/ChatGLM-6B/conf/models/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/data/ChatGLM-6B/conf/models/glm-4-9b-chat, use_v2_block_manager=False, enable_prefix_caching=False) WARNING 08-27 14:00:05 tokenizer.py:129] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 08-27 14:00:05 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 08-27 14:00:05 selector.py:54] Using XFormers backend. WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 2.4.0+cu121 with CUDA 1201 (you have 2.4.0) Python 3.10.14 (you have 3.10.12) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details /data/soft/anaconda3/envs/langchain/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /data/t/anaconda3/envs/langchain/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") Traceback (most recent call last): File "/data/GLM-4/basic_demo/openai_api_server.py", line 683, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/data//envs/langchain/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args engine = cls( File "/data//anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 381, in init self.engine = self._init_engine(*args, **kwargs) File "/data//anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine return engine_class(*args, **kwargs) File "/data/t/anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in init self.model_executor = executor_class( File "/data//anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in init self._init_executor() File "/data//anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 35, in _init_executor self.driver_worker.init_device() File "/data/st/anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/worker/worker.py", line 125, in init_device _check_if_gpu_supports_dtype(self.model_config.dtype) File "/data/soft/anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/worker/worker.py", line 358, in _check_if_gpu_supports_dtype raise ValueError( ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-PCIE-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting thedtype flag in CLI, for example: --dtype=half.

Expected behavior / 期待表现

希望解决bug，感谢感谢

Aug 27 '24 06:08 ucas010

感谢

Aug 27 '24 06:08 ucas010

hello , 使用简单的代码也能复现这个bug， from transformers import AutoTokenizer from vllm import LLM, SamplingParams

GLM-4-9B-Chat-1M

max_model_len, tp_size = 1048576, 4

如果遇见 OOM 现象，建议减少max_model_len，或者增加tp_size

max_model_len, tp_size = 131072, 1 model_name = "THUDM/glm-4-9b-chat" prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) llm = LLM( model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True, # GLM-4-9B-Chat-1M 如果遇见 OOM 现象，建议开启下述参数 # enable_chunked_prefill=True, # max_num_batched_tokens=8192 )

请教下咋回事？

Aug 27 '24 06:08 ucas010

v100不支持bf16

Aug 27 '24 06:08 zhipuch

请问下有bf8支持么？

Aug 27 '24 06:08 ucas010

dtype 改为float16后出现 ttributeError: '_OpNamespace' '_C' object has no attribute 'rms_norm'

Aug 27 '24 07:08 ucas010

看一下readme按照里面的requirement安装一下环境吧，fp16可以推理但不推荐，可能会出现小问题，最好用bf16

Aug 27 '24 08:08 zhipuch

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-PCIE-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

GLM-4-9B-Chat-1M

max_model_len, tp_size = 1048576, 4

如果遇见 OOM 现象，建议减少max_model_len，或者增加tp_size