transformerlab-app
transformerlab-app copied to clipboard
run vllm server on V100 error
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-SXM2-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting thedtype flag in CLI, for example: --dtype=half
detail log:
Starting VLLM Server
WARNING 04-16 12:12:41 arg_utils.py:766] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 04-16 12:12:41 config.py:820] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 04-16 12:12:41 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 04-16 12:12:43 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 04-16 12:12:43 selector.py:54] Using XFormers backend.
b'2025-04-16 12:12:43 | ERROR | stderr | /.transformerlab/envs/transformerlab/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.\n'
b'2025-04-16 12:12:43 | ERROR | stderr | @torch.library.impl_abstract("xformers_flash::flash_fwd")\n'
b'2025-04-16 12:12:43 | ERROR | stderr | /.transformerlab/envs/transformerlab/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.\n'
b'2025-04-16 12:12:43 | ERROR | stderr | @torch.library.impl_abstract("xformers_flash::flash_bwd")\n'
b'2025-04-16 12:12:43 | ERROR | stderr | Traceback (most recent call last):\n'
b'2025-04-16 12:12:43 | ERROR | stderr | File "<frozen runpy>", line 198, in _run_module_as_main\n'
b'2025-04-16 12:12:43 | ERROR | stderr | File "<frozen runpy>", line 88, in _run_code\n'
b'2025-04-16 12:12:43 | ERROR | stderr | File "/.transformerlab/envs/transformerlab/lib/python3.11/site-packages/fastchat/serve/vllm_worker.py", line 290, in <module>\n'
b'2025-04-16 12:12:43 | ERROR | stderr | engine = AsyncLLMEngine.from_engine_args(engine_args)\n'
b'2025-04-16 12:12:43 | ERROR | stderr | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n'
b'2025-04-16 12:12:43 | ERROR | stderr | File "/.transformerlab/envs/transformerlab/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args\n'
b'2025-04-16 12:12:43 | ERROR | stderr | engine = cls(\n'
b'2025-04-16 12:12:43 | ERROR | stderr | ^^^^\n'
b'2025-04-16 12:12:43 | ERROR | stderr | File "/.transformerlab/envs/transformerlab/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__\n'
b'2025-04-16 12:12:43 | ERROR | stderr | self.engine = self._init_engine(*args, **kwargs)\n'
b'2025-04-16 12:12:43 | ERROR | stderr | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n'
b'2025-04-16 12:12:43 | ERROR | stderr | File "/.transformerlab/envs/transformerlab/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine\n'
b'2025-04-16 12:12:43 | ERROR | stderr | return engine_class(*args, **kwargs)\n'
b'2025-04-16 12:12:43 | ERROR | stderr | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n'
b'2025-04-16 12:12:43 | ERROR | stderr | File "/.transformerlab/envs/transformerlab/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 249, in __init__\n'
b'2025-04-16 12:12:43 | ERROR | stderr | self.model_executor = executor_class(\n'
b'2025-04-16 12:12:43 | ERROR | stderr | ^^^^^^^^^^^^^^^\n'
b'2025-04-16 12:12:43 | ERROR | stderr | File "/.transformerlab/envs/transformerlab/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in __init__\n'
b'2025-04-16 12:12:43 | ERROR | stderr | self._init_executor()\n'
b'2025-04-16 12:12:43 | ERROR | stderr | File "/.transformerlab/envs/transformerlab/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 35, in _init_executor\n'
b'2025-04-16 12:12:43 | ERROR | stderr | self.driver_worker.init_device()\n'
b'2025-04-16 12:12:43 | ERROR | stderr | File "/.transformerlab/envs/transformerlab/lib/python3.11/site-packages/vllm/worker/worker.py", line 125, in init_device\n'
b'2025-04-16 12:12:43 | ERROR | stderr | _check_if_gpu_supports_dtype(self.model_config.dtype)\n'
b'2025-04-16 12:12:43 | ERROR | stderr | File "/.transformerlab/envs/transformerlab/lib/python3.11/site-packages/vllm/worker/worker.py", line 358, in _check_if_gpu_supports_dtype\n'
b'2025-04-16 12:12:43 | ERROR | stderr | raise ValueError(\n'
b'2025-04-16 12:12:43 | ERROR | stderr | ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-SXM2-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.\n'
VLLM Server exited
['/.transformerlab/envs/transformerlab/bin/python3', '-m', 'fastchat.serve.vllm_worker', '--model-path', 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B']
Error loading model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B with exit code 1
GPU info
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-SXM2-32GB On | 00000000:65:01.0 Off | 0 |
| N/A 40C P0 40W / 300W | 2MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+