vllm [Bug]: RuntimeError: Unknown layout

Your current environment

PyTorch version: 2.2.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.27

Python version: 3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090
GPU 2: NVIDIA GeForce RTX 4090
GPU 3: NVIDIA GeForce RTX 4090

Nvidia driver version: 535.146.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
架构：           x86_64
CPU 运行模式：   32-bit, 64-bit
字节序：         Little Endian
CPU:             32
在线 CPU 列表：  0-31
每个核的线程数： 2
每个座的核数：   16
座：             1
NUMA 节点：      4
厂商 ID：        AuthenticAMD
CPU 系列：       23
型号：           49
型号名称：       AMD EPYC 7302 16-Core Processor
步进：           0
CPU MHz：        1486.662
CPU 最大 MHz：   3000.0000
CPU 最小 MHz：   1500.0000
BogoMIPS：       5988.92
虚拟化：         AMD-V
L1d 缓存：       32K
L1i 缓存：       32K
L2 缓存：        512K
L3 缓存：        16384K
NUMA 节点0 CPU： 0-3,16-19
NUMA 节点1 CPU： 4-7,20-23
NUMA 节点2 CPU： 8-11,24-27
NUMA 节点3 CPU： 12-15,28-31
标记：           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.2
[pip3] triton==2.2.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.2.2                    pypi_0    pypi
[conda] triton                    2.2.0                    pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS     12-15,28-31     3               N/A
GPU1    SYS      X      SYS     SYS     8-11,24-27      2               N/A
GPU2    SYS     SYS      X      SYS     4-7,20-23       1               N/A
GPU3    SYS     SYS     SYS      X      0-3,16-19       0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

(vllm) root@4090:/DATA4T/text-generation-webui/vllm# python -m vllm.entrypoints.openai.api_server --model /DATA4T/text-generation-webui/models/c4ai-command-r-plus-GPTQ --tensor-parallel-size 4 --enforce-eager INFO 04-15 07:27:04 pynccl.py:58] Loading nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 INFO 04-15 07:27:05 api_server.py:149] vLLM API server version 0.4.0.post1 INFO 04-15 07:27:05 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/DATA4T/text-generation-webui/models/c4ai-command-r-plus-GPTQ', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, tensorizer_uri=None, verify_hash=False, encryption_keyfile=None, num_readers=1, s3_access_key_id=None, s3_secret_access_key=None, s3_endpoint=None, vllm_tensorized=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 04-15 07:27:05 config.py:225] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. 2024-04-15 07:27:07,765 INFO worker.py:1752 -- Started a local Ray instance. INFO 04-15 07:27:08 llm_engine.py:82] Initializing an LLM engine (v0.4.0.post1) with config: model='/DATA4T/text-generation-webui/models/c4ai-command-r-plus-GPTQ', speculative_config=None, tokenizer='/DATA4T/text-generation-webui/models/c4ai-command-r-plus-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. (pid=1812) INFO 04-15 07:27:10 pynccl.py:58] Loading nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 (pid=2303) INFO 04-15 07:27:16 pynccl.py:58] Loading nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.) INFO 04-15 07:27:16 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance. INFO 04-15 07:27:16 selector.py:33] Using XFormers backend. (RayWorkerVllm pid=1969) INFO 04-15 07:27:16 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance. (RayWorkerVllm pid=1969) INFO 04-15 07:27:16 selector.py:33] Using XFormers backend. (RayWorkerVllm pid=1969) [rank1]:[W ProcessGroupGloo.cpp:721] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator()) [rank0]:[W ProcessGroupGloo.cpp:721] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator()) INFO 04-15 07:27:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1 (RayWorkerVllm pid=1969) INFO 04-15 07:27:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1 INFO 04-15 07:27:19 custom_all_reduce.py:152] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped WARNING 04-15 07:27:19 custom_all_reduce.py:58] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (RayWorkerVllm pid=1969) INFO 04-15 07:27:19 custom_all_reduce.py:152] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped (RayWorkerVllm pid=1969) WARNING 04-15 07:27:19 custom_all_reduce.py:58] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. INFO 04-15 07:27:23 model_runner.py:169] Loading model weights took 14.3474 GB (RayWorkerVllm pid=1969) INFO 04-15 07:27:25 model_runner.py:169] Loading model weights took 14.3474 GB (RayWorkerVllm pid=2303) INFO 04-15 07:27:16 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance. [repeated 2x across cluster] (RayWorkerVllm pid=2303) INFO 04-15 07:27:16 selector.py:33] Using XFormers backend. [repeated 2x across cluster] (RayWorkerVllm pid=2303) INFO 04-15 07:27:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1 [repeated 2x across cluster] (RayWorkerVllm pid=2303) INFO 04-15 07:27:19 custom_all_reduce.py:152] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped [repeated 2x across cluster] (RayWorkerVllm pid=2303) WARNING 04-15 07:27:19 custom_all_reduce.py:58] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 2x across cluster] (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution. (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] Traceback (most recent call last): (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/engine/ray_utils.py", line 43, in execute_method (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return executor(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return func(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/worker/worker.py", line 134, in determine_num_available_blocks (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] self.model_runner.profile_run() (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return func(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/worker/model_runner.py", line 918, in profile_run (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] self.execute_model(seqs, kv_caches) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return func(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/worker/model_runner.py", line 839, in execute_model (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] hidden_states = model_executable(**execute_model_kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return func(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 320, in forward (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] hidden_states = self.model(input_ids, positions, kv_caches, (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 286, in forward (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] hidden_states, residual = layer( (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 243, in forward (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] hidden_states_attention = self.self_attn( (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 208, in forward (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] qkv, _ = self.qkv_proj(hidden_states) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, **kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/layers/linear.py", line 218, in forward (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] output_parallel = self.linear_method.apply_weights(self, input, bias) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/layers/quantization/gptq.py", line 214, in apply_weights (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] output = ops.gptq_gemm(reshaped_x, layer.qweight, layer.qzeros, (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/_custom_ops.py", line 133, in gptq_gemm (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return vllm_ops.gptq_gemm(a, b_q_weight, b_gptq_qzeros, b_gptq_scales, (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] RuntimeError: Unknown layout Traceback (most recent call last): File "/root/anaconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/anaconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/DATA4T/text-generation-webui/vllm/vllm/entrypoints/openai/api_server.py", line 157, in engine = AsyncLLMEngine.from_engine_args( File "/DATA4T/text-generation-webui/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args engine = cls( File "/DATA4T/text-generation-webui/vllm/vllm/engine/async_llm_engine.py", line 311, in init self.engine = self._init_engine(*args, **kwargs) File "/DATA4T/text-generation-webui/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine return engine_class(*args, **kwargs) File "/DATA4T/text-generation-webui/vllm/vllm/engine/llm_engine.py", line 133, in init self._initialize_kv_caches() File "/DATA4T/text-generation-webui/vllm/vllm/engine/llm_engine.py", line 193, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) File "/DATA4T/text-generation-webui/vllm/vllm/executor/ray_gpu_executor.py", line 215, in determine_num_available_blocks num_blocks = self._run_workers("determine_num_available_blocks", ) File "/DATA4T/text-generation-webui/vllm/vllm/executor/ray_gpu_executor.py", line 313, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/DATA4T/text-generation-webui/vllm/vllm/worker/worker.py", line 134, in determine_num_available_blocks self.model_runner.profile_run() File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/DATA4T/text-generation-webui/vllm/vllm/worker/model_runner.py", line 918, in profile_run self.execute_model(seqs, kv_caches) File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/DATA4T/text-generation-webui/vllm/vllm/worker/model_runner.py", line 839, in execute_model hidden_states = model_executable(**execute_model_kwargs) File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 320, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 286, in forward hidden_states, residual = layer( File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 243, in forward hidden_states_attention = self.self_attn( File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 208, in forward qkv, _ = self.qkv_proj(hidden_states) File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in call_impl return forward_call(*args, **kwargs) File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/layers/linear.py", line 218, in forward output_parallel = self.linear_method.apply_weights(self, input, bias) File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/layers/quantization/gptq.py", line 214, in apply_weights output = ops.gptq_gemm(reshaped_x, layer.qweight, layer.qzeros, File "/DATA4T/text-generation-webui/vllm/vllm/_custom_ops.py", line 133, in gptq_gemm return vllm_ops.gptq_gemm(a, b_q_weight, b_gptq_qzeros, b_gptq_scales, RuntimeError: Unknown layout

Apr 14 '24 23:04 zzlgreat

@zzlgreat same error for me too.

May 02 '24 23:05 abpani

I get the same first line of the error above: "Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution". If I set the following environmental variable VLLM loads with no errors: NCCL_SOCKET_IFNAME=eth0

May 10 '24 19:05 eByteTheDust

same error for me on amd64 ,maybe there is some cuda toolkit need ?,and may amd64 is not supported...

Jun 18 '24 07:06 sherry-kaikai