[Bug]: RuntimeError: Unknown layout
Your current environment
PyTorch version: 2.2.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.27
Python version: 3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090
GPU 2: NVIDIA GeForce RTX 4090
GPU 3: NVIDIA GeForce RTX 4090
Nvidia driver version: 535.146.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
架构: x86_64
CPU 运行模式: 32-bit, 64-bit
字节序: Little Endian
CPU: 32
在线 CPU 列表: 0-31
每个核的线程数: 2
每个座的核数: 16
座: 1
NUMA 节点: 4
厂商 ID: AuthenticAMD
CPU 系列: 23
型号: 49
型号名称: AMD EPYC 7302 16-Core Processor
步进: 0
CPU MHz: 1486.662
CPU 最大 MHz: 3000.0000
CPU 最小 MHz: 1500.0000
BogoMIPS: 5988.92
虚拟化: AMD-V
L1d 缓存: 32K
L1i 缓存: 32K
L2 缓存: 512K
L3 缓存: 16384K
NUMA 节点0 CPU: 0-3,16-19
NUMA 节点1 CPU: 4-7,20-23
NUMA 节点2 CPU: 8-11,24-27
NUMA 节点3 CPU: 12-15,28-31
标记: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.2
[pip3] triton==2.2.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] torch 2.2.2 pypi_0 pypi
[conda] triton 2.2.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS SYS 12-15,28-31 3 N/A
GPU1 SYS X SYS SYS 8-11,24-27 2 N/A
GPU2 SYS SYS X SYS 4-7,20-23 1 N/A
GPU3 SYS SYS SYS X 0-3,16-19 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
🐛 Describe the bug
(vllm) root@4090:/DATA4T/text-generation-webui/vllm# python -m vllm.entrypoints.openai.api_server --model /DATA4T/text-generation-webui/models/c4ai-command-r-plus-GPTQ --tensor-parallel-size 4 --enforce-eager
INFO 04-15 07:27:04 pynccl.py:58] Loading nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 04-15 07:27:05 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-15 07:27:05 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/DATA4T/text-generation-webui/models/c4ai-command-r-plus-GPTQ', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, tensorizer_uri=None, verify_hash=False, encryption_keyfile=None, num_readers=1, s3_access_key_id=None, s3_secret_access_key=None, s3_endpoint=None, vllm_tensorized=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 04-15 07:27:05 config.py:225] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-04-15 07:27:07,765 INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-15 07:27:08 llm_engine.py:82] Initializing an LLM engine (v0.4.0.post1) with config: model='/DATA4T/text-generation-webui/models/c4ai-command-r-plus-GPTQ', speculative_config=None, tokenizer='/DATA4T/text-generation-webui/models/c4ai-command-r-plus-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(pid=1812) INFO 04-15 07:27:10 pynccl.py:58] Loading nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(pid=2303) INFO 04-15 07:27:16 pynccl.py:58] Loading nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
INFO 04-15 07:27:16 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 04-15 07:27:16 selector.py:33] Using XFormers backend.
(RayWorkerVllm pid=1969) INFO 04-15 07:27:16 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
(RayWorkerVllm pid=1969) INFO 04-15 07:27:16 selector.py:33] Using XFormers backend.
(RayWorkerVllm pid=1969) [rank1]:[W ProcessGroupGloo.cpp:721] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank0]:[W ProcessGroupGloo.cpp:721] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 04-15 07:27:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=1969) INFO 04-15 07:27:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1
INFO 04-15 07:27:19 custom_all_reduce.py:152] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped
WARNING 04-15 07:27:19 custom_all_reduce.py:58] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=1969) INFO 04-15 07:27:19 custom_all_reduce.py:152] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped
(RayWorkerVllm pid=1969) WARNING 04-15 07:27:19 custom_all_reduce.py:58] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 04-15 07:27:23 model_runner.py:169] Loading model weights took 14.3474 GB
(RayWorkerVllm pid=1969) INFO 04-15 07:27:25 model_runner.py:169] Loading model weights took 14.3474 GB
(RayWorkerVllm pid=2303) INFO 04-15 07:27:16 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance. [repeated 2x across cluster]
(RayWorkerVllm pid=2303) INFO 04-15 07:27:16 selector.py:33] Using XFormers backend. [repeated 2x across cluster]
(RayWorkerVllm pid=2303) INFO 04-15 07:27:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1 [repeated 2x across cluster]
(RayWorkerVllm pid=2303) INFO 04-15 07:27:19 custom_all_reduce.py:152] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped [repeated 2x across cluster]
(RayWorkerVllm pid=2303) WARNING 04-15 07:27:19 custom_all_reduce.py:58] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 2x across cluster]
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] Traceback (most recent call last):
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/engine/ray_utils.py", line 43, in execute_method
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return executor(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return func(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/worker/worker.py", line 134, in determine_num_available_blocks
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] self.model_runner.profile_run()
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return func(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/worker/model_runner.py", line 918, in profile_run
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] self.execute_model(seqs, kv_caches)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return func(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/worker/model_runner.py", line 839, in execute_model
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] hidden_states = model_executable(**execute_model_kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return func(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 320, in forward
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] hidden_states = self.model(input_ids, positions, kv_caches,
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 286, in forward
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] hidden_states, residual = layer(
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 243, in forward
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] hidden_states_attention = self.self_attn(
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 208, in forward
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] qkv, _ = self.qkv_proj(hidden_states)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in call_impl
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, **kwargs)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/layers/linear.py", line 218, in forward
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] output_parallel = self.linear_method.apply_weights(self, input, bias)
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/layers/quantization/gptq.py", line 214, in apply_weights
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] output = ops.gptq_gemm(reshaped_x, layer.qweight, layer.qzeros,
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/_custom_ops.py", line 133, in gptq_gemm
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return vllm_ops.gptq_gemm(a, b_q_weight, b_gptq_qzeros, b_gptq_scales,
(RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] RuntimeError: Unknown layout
Traceback (most recent call last):
File "/root/anaconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/DATA4T/text-generation-webui/vllm/vllm/entrypoints/openai/api_server.py", line 157, in
@zzlgreat same error for me too.
I get the same first line of the error above: "Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution". If I set the following environmental variable VLLM loads with no errors: NCCL_SOCKET_IFNAME=eth0
same error for me on amd64 ,maybe there is some cuda toolkit need ?,and may amd64 is not supported...