vllm [Bug]: `facebook/chameleon-30b` triggers assertion error while loading weights

Your current environment

The output of `python collect_env.py`

PyTorch version: 2.4.0+cu121                                                                                                           
Is debug build: False                                                                                                                  
CUDA used to build PyTorch: 12.1                                                                                                       
ROCM used to build PyTorch: N/A                                    
                                                                   
OS: Ubuntu 20.04.6 LTS (x86_64)                                                                                                        
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0                                                                                     
Clang version: Could not collect                                                                                                       
CMake version: version 3.30.2                                                                                                          
Libc version: glibc-2.31                                                                                                               
                                                                                                                                       
Python version: 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-1023-aws-x86_64-with-glibc2.31
Is CUDA available: True                                            
CUDA runtime version: 12.4.131                                     
CUDA_MODULE_LOADING set to: LAZY                                   
GPU models and configuration:                                      
GPU 0: NVIDIA A100-SXM4-40GB                                       
GPU 1: NVIDIA A100-SXM4-40GB                                       
                                                                   
Nvidia driver version: 550.90.07                                   
cuDNN version: Could not collect                                   
HIP runtime version: N/A                                                                                                               
MIOpen runtime version: N/A                                                                                                            
Is XNNPACK available: True                                         
                                                                                                                                       
CPU:                                                               
Architecture:                       x86_64                                                                                             
CPU op-mode(s):                     32-bit, 64-bit         
Byte Order:                         Little Endian          
Address sizes:                      46 bits physical, 48 bits virtual
CPU(s):                             96                                                                                                 
On-line CPU(s) list:                0-95                                                                                               
Thread(s) per core:                 2           
Core(s) per socket:                 24                                                                                                 
Socket(s):                          2                                                                                                  
NUMA node(s):                       2                                                                                                  
Vendor ID:                          GenuineIntel                                                                                       
CPU family:                         6                                                                                                  
Model:                              85                                                                                                 
Model name:                         Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz                                                     
Stepping:                           7
CPU MHz:                            1257.578
BogoMIPS:                           5999.99
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          1.5 MiB
L1i cache:                          1.5 MiB
L2 cache:                           48 MiB
L3 cache:                           71.5 MiB
NUMA node0 CPU(s):                  0-23,48-71
NUMA node1 CPU(s):                  24-47,72-95
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status                                                            
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported                                                                   
Vulnerability L1tf:                 Mitigation; PTE Inversion                                                                          
Vulnerability Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:             Mitigation; PTI                
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Vulnerable                     
Vulnerability Spec rstack overflow: Not affected                   
Vulnerability Spec store bypass:    Vulnerable                     
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
Vulnerability Srbds:                Not affected                   
Vulnerability Tsx async abort:      Not affected                   
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 s
s ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pcl
mulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm ab
m 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt
 clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
                                                                                                                                       
Versions of relevant libraries:                                    
[pip3] flashinfer==0.1.2+cu121torch2.4                     
[pip3] numpy==1.26.4                                                                                                                   
[pip3] nvidia-nccl-cu12==2.20.5                                                                                                        
[pip3] pyzmq==26.1.0                                                                                                                   
[pip3] torch==2.4.0                                                
[pip3] torchvision==0.19.0                                                                                                             
[pip3] transformers==4.44.0                                                                                                            
[pip3] triton==3.0.0                                                                                                                   
[conda] Could not collect                                                                                                              
ROCM Version: Could not collect                                                                                                        
Neuron SDK Version: N/A                                                                                                                
vLLM Version: 0.5.4                                                                                                                    
vLLM Build Flags:                                                  
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:                                                      
GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID        
GPU0     X      NV12    0-23,48-71      0               N/A        
GPU1    NV12     X      0-23,48-71      0               N/A        
                                                                   
Legend:                                                            
                                                                   
  X    = Self                                                      
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node                           
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)                                                  
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge                                                                            
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

Running vLLM v0.5.4 with tensor parallelism degree 2. facebook/chameleon-7b works well (this one with TP 1), but facebook/chameleon-30b fails to load weights while reading in the first safetensors shard. The name of the parameter that triggers the assertion is model.layers.3.self_attn.k_norm.bias.

Traceback (most recent call last):                                                                                                     
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap                                                       
    self.run()                                                                                                                         
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run                                                              
    self._target(*self._args, **self._kwargs)                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server                    
    server = AsyncEngineRPCServer(async_engine_args, usage_context, port)                                                              
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__                           
    self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,                                                                   
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 484, in from_engine_args                        
    engine = cls(                                                                                                                      
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 394, in __init__                                
    self.engine = self._init_engine(*args, **kwargs)                                                                                   
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 565, in _init_engine
    return engine_class(*args, **kwargs)                                                                                               
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 274, in __init__                                      
    self.model_executor = executor_class(                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__                        
    super().__init__(*args, **kwargs)                                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)                                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()                                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 138, in _init_executor                  
    self._run_workers("load_model",                                                                                                    
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers                    
    driver_worker_output = driver_worker_method(*args, **kwargs)                                                                       
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
    self.model_runner.load_model()                                 
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 722, in load_model
    self.model = get_model(model_config=self.model_config,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model                   
    return loader.load_model(model_config=model_config,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model                   
    model.load_weights(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/chameleon.py", line 1048, in load_weights
    weight_loader(param, loaded_weight)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 470, in default_weight_loader
    assert param.size() == loaded_weight.size()
AssertionError
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]

Aug 10 '24 08:08 jaywonchung

cc @ywang96

Aug 10 '24 15:08 DarkLight1337

Please check if #7410 works on your end.

Aug 12 '24 11:08 DarkLight1337