vllm
vllm copied to clipboard
open_llama: ValueError: head_size (100) is not supported. Supported head sizes: [64, 80, 96, 128]
I start vllm with “openlm-research/open_llama_3b” model:
from vllm import LLM, SamplingParams
import os
prompts = [
"The president of the United States is"
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/home/ubuntu/wangjibo/models/openlm-research_open_llama_3b/", gpu_memory_utilization=0.3)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
the model file dir:
total 6.4G
drwxr-xr-x 2 ubuntu ubuntu 4.0K Jul 3 12:56 ./
drwxrwxr-x 6 ubuntu ubuntu 4.0K Jul 3 12:14 ../
-rw-r--r-- 1 ubuntu ubuntu 506 Jul 3 12:14 config.json
-rw-r--r-- 1 ubuntu ubuntu 137 Jul 3 12:14 generation_config.json
-rw-r--r-- 1 ubuntu ubuntu 289 Jul 3 12:56 huggingface-metadata.txt
-rw-r--r-- 1 ubuntu ubuntu 6.4G Jul 3 12:56 pytorch_model.bin
-rw-r--r-- 1 ubuntu ubuntu 11K Jul 3 12:14 README.md
-rw-r--r-- 1 ubuntu ubuntu 330 Jul 3 12:14 special_tokens_map.json
-rw-r--r-- 1 ubuntu ubuntu 593 Jul 3 12:14 tokenizer_config.json
-rw-r--r-- 1 ubuntu ubuntu 522K Jul 3 12:14 tokenizer.model
An error like this :
(myenv) ubuntu@ubuntu:~/wangjibo/vllm-py$ python batch-inference.py
INFO 07-03 15:24:14 llm_engine.py:59] Initializing an LLM engine with config: model='/home/ubuntu/wangjibo/models/openlm-research_open_llama_3b/', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 07-03 15:24:14 tokenizer_utils.py:22] OpenLLaMA models do not support the fast tokenizer. Using the slow tokenizer instead.
Traceback (most recent call last):
File "batch-inference.py", line 10, in <module>
llm = LLM(model="/home/ubuntu/wangjibo/models/openlm-research_open_llama_3b/", gpu_memory_utilization=0.3)
File "/home/ubuntu/miniconda3/envs/myenv/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 55, in __init__
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/home/ubuntu/miniconda3/envs/myenv/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 151, in from_engine_args
engine = cls(*engine_configs, distributed_init_method, devices,
File "/home/ubuntu/miniconda3/envs/myenv/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 93, in __init__
worker = worker_cls(
File "/home/ubuntu/miniconda3/envs/myenv/lib/python3.8/site-packages/vllm/worker/worker.py", line 45, in __init__
self.model = get_model(model_config)
File "/home/ubuntu/miniconda3/envs/myenv/lib/python3.8/site-packages/vllm/model_executor/model_loader.py", line 39, in get_model
model = model_class(model_config.hf_config)
File "/home/ubuntu/miniconda3/envs/myenv/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 215, in __init__
self.model = LlamaModel(config)
File "/home/ubuntu/miniconda3/envs/myenv/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 182, in __init__
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "/home/ubuntu/miniconda3/envs/myenv/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 182, in <listcomp>
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "/home/ubuntu/miniconda3/envs/myenv/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 132, in __init__
self.self_attn = LlamaAttention(
File "/home/ubuntu/miniconda3/envs/myenv/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 107, in __init__
self.attn = PagedAttentionWithRoPE(self.num_heads, self.head_dim,
File "/home/ubuntu/miniconda3/envs/myenv/lib/python3.8/site-packages/vllm/model_executor/layers/attention.py", line 179, in __init__
super().__init__(num_heads, head_size, scale)
File "/home/ubuntu/miniconda3/envs/myenv/lib/python3.8/site-packages/vllm/model_executor/layers/attention.py", line 52, in __init__
raise ValueError(f"head_size ({self.head_size}) is not supported. "
ValueError: head_size (100) is not supported. Supported head sizes: [64, 80, 96, 128].
How can i fix it?
duplicate issue: #302
Having the same problem with openllama-3b.
Hi @jibowang, thanks for raising the issue and apologies for the late response. Unfortunately, head size 100 is not supported by xformers. The library requires the head size to be a multiple of 8.
I had the same issue when trying to load psmathur/orca_mini_3b
. I don't think the 3B LLama models are currently supported.