vllm
vllm copied to clipboard
[Bug]: Phi3 still not supported
Your current environment
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35
Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1041-nvidia-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.66
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB
Nvidia driver version: 525.147.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7742 64-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
Stepping: 0
Frequency boost: enabled
CPU max MHz: 2250.0000
CPU min MHz: 1500.0000
BogoMIPS: 4491.63
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization: AMD-V
L1d cache: 4 MiB (128 instances)
L1i cache: 4 MiB (128 instances)
L2 cache: 64 MiB (128 instances)
L3 cache: 512 MiB (32 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-15,128-143
NUMA node1 CPU(s): 16-31,144-159
NUMA node2 CPU(s): 32-47,160-175
NUMA node3 CPU(s): 48-63,176-191
NUMA node4 CPU(s): 64-79,192-207
NUMA node5 CPU(s): 80-95,208-223
NUMA node6 CPU(s): 96-111,224-239
NUMA node7 CPU(s): 112-127,240-255
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.1
[pip3] torchaudio==2.1.2
[pip3] torchvision==0.16.2
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.8 py310h5eee18b_0
[conda] mkl_random 1.2.4 py310hdb19cb5_0
[conda] numpy 1.26.4 py310h5f9d8c6_0
[conda] numpy-base 1.26.4 py310hb5e798b_0
[conda] nvidia-nccl-cu12 2.19.3 pypi_0 pypi
[conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch 2.2.1 pypi_0 pypi
[conda] torchaudio 2.1.2 py310_cu121 pytorch
[conda] torchvision 0.16.2 py310_cu121 pytorch
[conda] triton 2.2.0 pypi_0 pypi
[conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5
NIC0 PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS
NIC2 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS
NIC3 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS
NIC7 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS
NIC8 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
NIC9 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
🐛 Describe the bug
Phi-3 still seems to not be supported after latest vllm install.
model_id = "microsoft/Phi-3-mini-4k-instruct"
llm = LLM(model=model_id, trust_remote_code=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 11
---> 11 llm = LLM(model=model_id, trust_remote_code=True)
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/llm.py:118, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, disable_custom_all_reduce, **kwargs)
98 kwargs["disable_log_stats"] = True
99 engine_args = EngineArgs(
100 model=model,
101 tokenizer=tokenizer,
(...)
116 **kwargs,
117 )
--> 118 self.llm_engine = LLMEngine.from_engine_args(
119 engine_args, usage_context=UsageContext.LLM_CLASS)
120 self.request_counter = Counter()
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py:277, in LLMEngine.from_engine_args(cls, engine_args, usage_context)
274 executor_class = GPUExecutor
276 # Create the LLM engine.
--> 277 engine = cls(
278 **engine_config.to_dict(),
279 executor_class=executor_class,
280 log_stats=not engine_args.disable_log_stats,
281 usage_context=usage_context,
282 )
283 return engine
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py:148, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config, decoding_config, executor_class, log_stats, usage_context)
144 self.seq_counter = Counter()
145 self.generation_config_fields = _load_generation_config_dict(
146 model_config)
--> 148 self.model_executor = executor_class(
149 model_config=model_config,
150 cache_config=cache_config,
151 parallel_config=parallel_config,
152 scheduler_config=scheduler_config,
153 device_config=device_config,
154 lora_config=lora_config,
155 vision_language_config=vision_language_config,
156 speculative_config=speculative_config,
157 load_config=load_config,
158 )
160 self._initialize_kv_caches()
162 # If usage stat is enabled, collect relevant info.
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/executor_base.py:41, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config)
38 self.vision_language_config = vision_language_config
39 self.speculative_config = speculative_config
---> 41 self._init_executor()
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py:22, in GPUExecutor._init_executor(self)
16 """Initialize the worker and load the model.
17
18 If speculative decoding is enabled, we instead create the speculative
19 worker.
20 """
21 if self.speculative_config is None:
---> 22 self._init_non_spec_worker()
23 else:
24 self._init_spec_worker()
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py:51, in GPUExecutor._init_non_spec_worker(self)
36 self.driver_worker = Worker(
37 model_config=self.model_config,
38 parallel_config=self.parallel_config,
(...)
48 is_driver_worker=True,
49 )
50 self.driver_worker.init_device()
---> 51 self.driver_worker.load_model()
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py:117, in Worker.load_model(self)
116 def load_model(self):
--> 117 self.model_runner.load_model()
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py:162, in ModelRunner.load_model(self)
160 def load_model(self) -> None:
161 with CudaMemoryProfiler() as m:
--> 162 self.model = get_model(
163 model_config=self.model_config,
164 device_config=self.device_config,
165 load_config=self.load_config,
166 lora_config=self.lora_config,
167 vision_language_config=self.vision_language_config,
168 parallel_config=self.parallel_config,
169 scheduler_config=self.scheduler_config,
170 )
172 self.model_memory_usage = m.consumed_memory
173 logger.info(f"Loading model weights took "
174 f"{self.model_memory_usage / float(2**30):.4f} GB")
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py:19, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, vision_language_config)
13 def get_model(
14 *, model_config: ModelConfig, load_config: LoadConfig,
15 device_config: DeviceConfig, parallel_config: ParallelConfig,
16 scheduler_config: SchedulerConfig, lora_config: Optional[LoRAConfig],
17 vision_language_config: Optional[VisionLanguageConfig]) -> nn.Module:
18 loader = get_model_loader(load_config)
---> 19 return loader.load_model(model_config=model_config,
20 device_config=device_config,
21 lora_config=lora_config,
22 vision_language_config=vision_language_config,
23 parallel_config=parallel_config,
24 scheduler_config=scheduler_config)
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:222, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, vision_language_config, parallel_config, scheduler_config)
220 with set_default_torch_dtype(model_config.dtype):
221 with torch.device(device_config.device):
--> 222 model = _initialize_model(model_config, self.load_config,
223 lora_config, vision_language_config)
224 model.load_weights(
225 self._get_weights_iterator(model_config.model,
226 model_config.revision,
(...)
229 "fall_back_to_pt_during_load",
230 True)), )
231 for _, module in model.named_modules():
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:87, in _initialize_model(model_config, load_config, lora_config, vision_language_config)
82 def _initialize_model(
83 model_config: ModelConfig, load_config: LoadConfig,
84 lora_config: Optional[LoRAConfig],
85 vision_language_config: Optional[VisionLanguageConfig]) -> nn.Module:
86 """Initialize a model with the given configurations."""
---> 87 model_class = get_model_architecture(model_config)[0]
88 linear_method = _get_linear_method(model_config, load_config)
90 return model_class(config=model_config.hf_config,
91 linear_method=linear_method,
92 **_get_model_initialization_kwargs(
93 model_class, lora_config, vision_language_config))
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py:35, in get_model_architecture(model_config)
33 if model_cls is not None:
34 return (model_cls, arch)
---> 35 raise ValueError(
36 f"Model architectures {architectures} are not supported for now. "
37 f"Supported architectures: {ModelRegistry.get_supported_archs()}")
ValueError: Model architectures ['Phi3ForCausalLM'] are not supported for now.
The release version 0.4.1 does not yet support Phi3. You can build the source from the main branch, which does support Phi3.
When do you plan to release 0.4.1.post with official phi3 support?
Also, thanks a lot for the amazing work you’ve been doing!
+1
Phi-3 still not supported in the main branch even though #4298 is merged. Will there be any estimate about official release date? Thanks
+1
+1
+1
+1, Phi-3 still not supported in the main branch even though https://github.com/vllm-project/vllm/pull/4298 is merged
+1,still bugs in phi-3, which generation will not stop
I cannot run Phi-3 with the vllm/vllm-openai:v0.4.2
image. I am using this command:
docker run -it --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8081:8000 \
--ipc=host \
--name vllm-openai-phi3 \
vllm/vllm-openai:v0.4.2 \
--model microsoft/Phi-3-mini-128k-instruct \
--max-model-len 128000 \
--dtype float16
I have tried to build the image from the source (main branch), but it takes a long time (45 minutes and still not finished).
Are you still encountering this issue in the latest version?
In my setup, vllm/vllm-openai:v0.4.3
does not work at all, let alone phi3
:(
nvidia-smi
Fri Jun 14 07:06:10 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:06:00.0 Off | N/A |
| 54% 23C P8 12W / 350W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off | 00000000:07:00.0 Off | N/A |
| 54% 22C P8 13W / 350W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Logs
docker compose up
WARN[0000] Found orphan containers ([backend-vllm-openai-phi3-1]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
[+] Running 2/2
✔ Network backend_default Created 0.1s
✔ Container backend-vllm-openai-1 Created 0.1s
Attaching to backend-vllm-openai-1
backend-vllm-openai-1 | /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
backend-vllm-openai-1 | warnings.warn(
backend-vllm-openai-1 | WARNING 06-14 07:04:02 config.py:1155] Casting torch.bfloat16 to torch.float16.
backend-vllm-openai-1 | 2024-06-14 07:04:06,104 INFO worker.py:1749 -- Started a local Ray instance.
backend-vllm-openai-1 | INFO 06-14 07:04:07 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='cognitivecomputations/dolphin-2.9-llama3-8b', speculative_config=None, tokenizer='cognitivecomputations/dolphin-2.9-llama3-8b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=cognitivecomputations/dolphin-2.9-llama3-8b)
backend-vllm-openai-1 | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] Error executing method init_worker. This might cause deadlock in distributed execution.
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] Traceback (most recent call last):
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] return executor(*args, **kwargs)
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] self.worker = worker_class(*args, **kwargs)
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] self.model_runner = ModelRunnerClass(
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] self.attn_backend = get_attn_backend(
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] if torch.cuda.get_device_capability()[0] < 8:
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] prop = get_device_properties(device)
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] _lazy_init() # will define _get_device_properties
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] torch._C._cuda_init()
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
backend-vllm-openai-1 | Traceback (most recent call last):
backend-vllm-openai-1 | File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
backend-vllm-openai-1 | return _run_code(code, main_globals, None,
backend-vllm-openai-1 |
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
backend-vllm-openai-1 | exec(code, run_globals)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 186, in <module>
backend-vllm-openai-1 | engine = AsyncLLMEngine.from_engine_args(
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
backend-vllm-openai-1 | engine = cls(
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 340, in __init__
backend-vllm-openai-1 | self.engine = self._init_engine(*args, **kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
backend-vllm-openai-1 | return engine_class(*args, **kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 222, in __init__
backend-vllm-openai-1 | self.model_executor = executor_class(
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 317, in __init__
backend-vllm-openai-1 | super().__init__(*args, **kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
backend-vllm-openai-1 | super().__init__(*args, **kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
backend-vllm-openai-1 | self._init_executor()
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor
backend-vllm-openai-1 | self._init_workers_ray(placement_group)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 169, in _init_workers_ray
backend-vllm-openai-1 | self._run_workers("init_worker", all_kwargs=init_worker_all_kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
backend-vllm-openai-1 | driver_worker_output = self.driver_worker.execute_method(
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
backend-vllm-openai-1 | raise e
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
backend-vllm-openai-1 | return executor(*args, **kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
backend-vllm-openai-1 | self.worker = worker_class(*args, **kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
backend-vllm-openai-1 | self.model_runner = ModelRunnerClass(
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
backend-vllm-openai-1 | self.attn_backend = get_attn_backend(
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
backend-vllm-openai-1 | backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
backend-vllm-openai-1 | if torch.cuda.get_device_capability()[0] < 8:
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
backend-vllm-openai-1 | prop = get_device_properties(device)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
backend-vllm-openai-1 | _lazy_init() # will define _get_device_properties
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
backend-vllm-openai-1 | torch._C._cuda_init()
backend-vllm-openai-1 | RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] Error executing method init_worker. This might cause deadlock in distributed execution.
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] Traceback (most recent call last):
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] return executor(*args, **kwargs)
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] return method(self, *_args, **_kwargs)
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] self.worker = worker_class(*args, **kwargs)
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] self.model_runner = ModelRunnerClass(
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] self.att
n_backend = get_attn_backend(
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] if torch.cuda.get_device_capability()[0] < 8:
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] prop = get_device_properties(device)
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] _lazy_init() # will define _get_device_properties
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] torch._C._cuda_init()
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
backend-vllm-openai-1 exited with code 1
Can you try installing the latest vLLM version (0.5.0.post1
)?
@RasoulNik your issue is a driver issue. see https://github.com/vllm-project/vllm/issues/4940#issuecomment-2145117095
The 0.5.0.post1 does not work for me either. I am going to try @youkaichao's suggestion.
Update Confirmation
I have updated my drivers, and now everything works properly. I can use phi3
. Thanks!
nvidia-smi
Fri Jun 14 09:00:46 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:06:00.0 Off | N/A |
| 53% 47C P2 149W / 350W | 21756MiB / 24576MiB | 29% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off | 00000000:07:00.0 Off | N/A |
| 54% 23C P8 13W / 350W | 4MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 11138 C python3 21742MiB |
+-----------------------------------------------------------------------------------------+
Docker Compose Configuration
services:
vllm-openai-phi3:
image: vllm/vllm-openai:v0.5.0.post1
environment:
- HUGGING_FACE_HUB_TOKEN=<>
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
ports:
- 8081:8000
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command:
- "--model"
- "microsoft/Phi-3-mini-128k-instruct"
- "--max-model-len"
- "20000"
- "--dtype"
- "float16"