vllm
vllm copied to clipboard
[WIP] Add Falcon
Only works for Falcon-7B for now. The Falcon-40B model generates garbage outputs. Needs debugging.
@WoosukKwon Thank you for your work on this!
I've been trying it out and get some weird output, perhaps I'm doing something wrong but thought it was worth reporting.
I'm running with: python3 -m vllm.entrypoints.api_server --model tiiuae/falcon-7b
And get:
curl localhost:8000/generate -d '{
"prompt": "The future of AI is",
"max_tokens":128,
"temperature":0.5,
"top_p":0.95
}'
{"text": ["The future of AI is here.\nAnd it\u2019s time to make sure you\u2019re ready.\nReady for the future is now.\nFrom the business leaders around the world\u2019s top AI is the biggest revolutionizing the way we\u2019s a big data-driven world of the future.\nThe future is now.\nThe future is here.\nThe future of the\nThe AI\u2019s.\nThe future.\nThe future.\nThe future is here.\nThe future is now.\nThe future is here.\nThe future is now.\n
I have tried different SamplingParams but couldn't get much more sense out of it
@WoosukKwon One more thing I wanted to check in on, I'm unable to run with tensor_parallel_size
> 1, not sure if this will be addressed at a later time:
python3 -m vllm.entrypoints.api_server --model tiiuae/falcon-7b --tensor-parallel-size 4
2023-07-03 21:25:47,225 INFO worker.py:1636 -- Started a local Ray instance.
INFO 07-03 21:25:47 llm_engine.py:60] Initializing an LLM engine with config: model='tiiuae/falcon-7b', tokenizer='tiiuae/falcon-7b', tokenizer_mode=auto, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
Traceback (most recent call last):
File "/opt/conda/envs/vllm-f/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/vllm-f/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/git/vllm/vllm/entrypoints/api_server.py", line 82, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/opt/git/vllm/vllm/engine/async_llm_engine.py", line 212, in from_engine_args
engine = cls(engine_args.worker_use_ray,
File "/opt/git/vllm/vllm/engine/async_llm_engine.py", line 49, in __init__
self.engine = engine_class(*args, **kwargs)
File "/opt/git/vllm/vllm/engine/llm_engine.py", line 79, in __init__
self._verify_args()
File "/opt/git/vllm/vllm/engine/llm_engine.py", line 112, in _verify_args
self.model_config.verify_with_parallel_config(self.parallel_config)
File "/opt/git/vllm/vllm/config.py", line 72, in verify_with_parallel_config
raise ValueError(
ValueError: Total number of attention heads (71) must be divisible by tensor parallel size (4).
Hi @ImranL, thanks for trying it out. Currently this PR is not ready and may have some bugs. We are working on this (and other MQA models).
That being said, the error msg Total number of attention heads (71) must be divisible by tensor parallel size (4).
is correct. vLLM currently requires the number of heads to be divisible by the number of GPUs you use. This may be fixed later, but not in this PR.
@WoosukKwon Any update when this will be available? I've tried it for Falcon-7B, the output doesn't seem right.
@AbdurNawaz @ImranL Please check out #592, a new PR that includes correct implementation of Falcon models.
Thanks @zhuohan123. I've tried #592 for falcon-7b-instruct and its working well.
Closed as #592 implements Falcon in a nicer way.