ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-PCIE-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
System Info / 系統信息
CUda 11.7 ,Python 3.10.12 gpu V100 32G 显存。 vllm 0.5.4 vllm-flash-attn 2.6.1 其他按照basic_demo里面的requirements安装的。
Who can help? / 谁可以帮助到您?
@wwewwt @Sengxian @davidlvxin @codazzy @
Information / 问题信息
- [X] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
python openai_api_server.py
WARNING 08-27 13:59:59 _custom_ops.py:15] Failed to import from vllm._C with ImportError('libcudart.so.12: cannot open shared object file: No such file or directory')
INFO 08-27 14:00:04 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/data/ChatGLM-6B/conf/models/glm-4-9b-chat', speculative_config=None, tokenizer='/data/ChatGLM-6B/conf/models/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/data/ChatGLM-6B/conf/models/glm-4-9b-chat, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 08-27 14:00:05 tokenizer.py:129] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 08-27 14:00:05 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-27 14:00:05 selector.py:54] Using XFormers backend.
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.4.0+cu121 with CUDA 1201 (you have 2.4.0)
Python 3.10.14 (you have 3.10.12)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
/data/soft/anaconda3/envs/langchain/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/data/t/anaconda3/envs/langchain/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
Traceback (most recent call last):
File "/data/GLM-4/basic_demo/openai_api_server.py", line 683, in dtype flag in CLI, for example: --dtype=half.
Expected behavior / 期待表现
希望解决bug,感谢感谢
感谢
hello , 使用简单的代码也能复现这个bug, from transformers import AutoTokenizer from vllm import LLM, SamplingParams
GLM-4-9B-Chat-1M
max_model_len, tp_size = 1048576, 4
如果遇见 OOM 现象,建议减少max_model_len,或者增加tp_size
max_model_len, tp_size = 131072, 1 model_name = "THUDM/glm-4-9b-chat" prompt = [{"role": "user", "content": "你好"}]
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) llm = LLM( model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True, # GLM-4-9B-Chat-1M 如果遇见 OOM 现象,建议开启下述参数 # enable_chunked_prefill=True, # max_num_batched_tokens=8192 )
请教下咋回事?
v100不支持bf16
请问下有bf8支持么?
dtype 改为float16后出现 ttributeError: '_OpNamespace' '_C' object has no attribute 'rms_norm'
看一下readme按照里面的requirement安装一下环境吧,fp16可以推理但不推荐,可能会出现小问题,最好用bf16