vllm [WIP] Add Falcon

Only works for Falcon-7B for now. The Falcon-40B model generates garbage outputs. Needs debugging.

Jun 30 '23 08:06 WoosukKwon

@WoosukKwon Thank you for your work on this!

I've been trying it out and get some weird output, perhaps I'm doing something wrong but thought it was worth reporting.

I'm running with: python3 -m vllm.entrypoints.api_server --model tiiuae/falcon-7b

And get:

curl localhost:8000/generate -d '{
"prompt": "The future of AI is",
"max_tokens":128,
"temperature":0.5,
"top_p":0.95
}'
{"text": ["The future of AI is here.\nAnd it\u2019s time to make sure you\u2019re ready.\nReady for the future is now.\nFrom the business leaders around the world\u2019s top AI is the biggest revolutionizing the way we\u2019s a big data-driven world of the future.\nThe future is now.\nThe future is here.\nThe future of the\nThe AI\u2019s.\nThe future.\nThe future.\nThe future is here.\nThe future is now.\nThe future is here.\nThe future is now.\n

I have tried different SamplingParams but couldn't get much more sense out of it

Jul 03 '23 21:07 ImranL

@WoosukKwon One more thing I wanted to check in on, I'm unable to run with tensor_parallel_size > 1, not sure if this will be addressed at a later time:

python3 -m vllm.entrypoints.api_server --model tiiuae/falcon-7b --tensor-parallel-size 4
2023-07-03 21:25:47,225 INFO worker.py:1636 -- Started a local Ray instance.
INFO 07-03 21:25:47 llm_engine.py:60] Initializing an LLM engine with config: model='tiiuae/falcon-7b', tokenizer='tiiuae/falcon-7b', tokenizer_mode=auto, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
Traceback (most recent call last):
  File "/opt/conda/envs/vllm-f/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/vllm-f/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/git/vllm/vllm/entrypoints/api_server.py", line 82, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/opt/git/vllm/vllm/engine/async_llm_engine.py", line 212, in from_engine_args
    engine = cls(engine_args.worker_use_ray,
  File "/opt/git/vllm/vllm/engine/async_llm_engine.py", line 49, in __init__
    self.engine = engine_class(*args, **kwargs)
  File "/opt/git/vllm/vllm/engine/llm_engine.py", line 79, in __init__
    self._verify_args()
  File "/opt/git/vllm/vllm/engine/llm_engine.py", line 112, in _verify_args
    self.model_config.verify_with_parallel_config(self.parallel_config)
  File "/opt/git/vllm/vllm/config.py", line 72, in verify_with_parallel_config
    raise ValueError(
ValueError: Total number of attention heads (71) must be divisible by tensor parallel size (4).

Jul 03 '23 21:07 ImranL

Hi @ImranL, thanks for trying it out. Currently this PR is not ready and may have some bugs. We are working on this (and other MQA models).

That being said, the error msg Total number of attention heads (71) must be divisible by tensor parallel size (4). is correct. vLLM currently requires the number of heads to be divisible by the number of GPUs you use. This may be fixed later, but not in this PR.

Jul 04 '23 18:07 WoosukKwon

@WoosukKwon Any update when this will be available? I've tried it for Falcon-7B, the output doesn't seem right.

Jul 19 '23 10:07 AbdurNawaz

@AbdurNawaz @ImranL Please check out #592, a new PR that includes correct implementation of Falcon models.

Jul 29 '23 22:07 zhuohan123

Thanks @zhuohan123. I've tried #592 for falcon-7b-instruct and its working well.

Aug 01 '23 04:08 AbdurNawaz

Closed as #592 implements Falcon in a nicer way.

Aug 01 '23 22:08 WoosukKwon

vllm vllm copied to clipboard

[WIP] Add Falcon

vllm
vllm copied to clipboard