vllm
vllm copied to clipboard
[Model] Add support for 360zhinao
Add support for 360zhinao model
We released the 360Zhinao model series:
- 360Zhinao-7B-Base
- 360Zhinao-7B-Chat-4K
- 360Zhinao-7B-Chat-32K
- 360Zhinao-7B-Chat-360K
Notable features of our 360Zhinao models are:
- Base Model: Leveraging a high-quality corpus of 3.4 trillion tokens consisting of mainly Chinese, English and code, we achieved competitive performance on relevant benchmarks against other 7B models.
- Chat Models: Powerful chat capabilities and three context lengths of 4K, 32K and 360K. 360K (around 500k Chinese characters) is the longest context length among Chinese open-sourced models upon release (Apr. 11, 2024).
@simon-mo can you help us review the code?
Getting
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/xmo/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module>
engine = AsyncLLMEngine.from_engine_args(
File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args
engine = cls(
File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine
return engine_class(*args, **kwargs)
File "/home/xmo/vllm/vllm/engine/llm_engine.py", line 121, in __init__
self.model_executor = executor_class(
File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 39, in __init__
self._init_worker()
File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 66, in _init_worker
self.driver_worker.load_model()
File "/home/xmo/vllm/vllm/worker/worker.py", line 113, in load_model
self.model_runner.load_model()
File "/home/xmo/vllm/vllm/worker/model_runner.py", line 158, in load_model
self.model = get_model(
File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 58, in get_model
model_class = _get_model_architecture(model_config)[0]
File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 41, in _get_model_architecture
model_cls = ModelRegistry.load_model_cls(arch)
File "/home/xmo/vllm/vllm/model_executor/models/__init__.py", line 99, in load_model_cls
module = importlib.import_module(
File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/home/xmo/vllm/vllm/model_executor/models/zhinao.py", line 43, in <module>
from vllm.model_executor.parallel_utils.parallel_state import (
ModuleNotFoundError: No module named 'vllm.model_executor.parallel_utils.parallel_state'
On
python -m vllm.entrypoints.openai.api_server --model qihoo360/360Zhinao-7B-Chat-4K --trust-remote-code
Getting
Traceback (most recent call last): File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/xmo/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module> engine = AsyncLLMEngine.from_engine_args( File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args engine = cls( File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__ self.engine = self._init_engine(*args, **kwargs) File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine return engine_class(*args, **kwargs) File "/home/xmo/vllm/vllm/engine/llm_engine.py", line 121, in __init__ self.model_executor = executor_class( File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 39, in __init__ self._init_worker() File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 66, in _init_worker self.driver_worker.load_model() File "/home/xmo/vllm/vllm/worker/worker.py", line 113, in load_model self.model_runner.load_model() File "/home/xmo/vllm/vllm/worker/model_runner.py", line 158, in load_model self.model = get_model( File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 58, in get_model model_class = _get_model_architecture(model_config)[0] File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 41, in _get_model_architecture model_cls = ModelRegistry.load_model_cls(arch) File "/home/xmo/vllm/vllm/model_executor/models/__init__.py", line 99, in load_model_cls module = importlib.import_module( File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 883, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/home/xmo/vllm/vllm/model_executor/models/zhinao.py", line 43, in <module> from vllm.model_executor.parallel_utils.parallel_state import ( ModuleNotFoundError: No module named 'vllm.model_executor.parallel_utils.parallel_state'
On
python -m vllm.entrypoints.openai.api_server --model qihoo360/360Zhinao-7B-Chat-4K --trust-remote-code
This branch works for the vllm 0.4.0 version. I will merge these 2 new refactor:
[Core] Refactor model loading code (https://github.com/vllm-project/vllm/pull/4097) Yard1 Yard1 committed 8 hours ago
[Core][Refactor] move parallel_utils into vllm/distributed (https://github.com/vllm-project/vllm/pull/3950) youkaichao youkaichao committed last week
Getting
Traceback (most recent call last): File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/xmo/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module> engine = AsyncLLMEngine.from_engine_args( File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args engine = cls( File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__ self.engine = self._init_engine(*args, **kwargs) File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine return engine_class(*args, **kwargs) File "/home/xmo/vllm/vllm/engine/llm_engine.py", line 121, in __init__ self.model_executor = executor_class( File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 39, in __init__ self._init_worker() File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 66, in _init_worker self.driver_worker.load_model() File "/home/xmo/vllm/vllm/worker/worker.py", line 113, in load_model self.model_runner.load_model() File "/home/xmo/vllm/vllm/worker/model_runner.py", line 158, in load_model self.model = get_model( File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 58, in get_model model_class = _get_model_architecture(model_config)[0] File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 41, in _get_model_architecture model_cls = ModelRegistry.load_model_cls(arch) File "/home/xmo/vllm/vllm/model_executor/models/__init__.py", line 99, in load_model_cls module = importlib.import_module( File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 883, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/home/xmo/vllm/vllm/model_executor/models/zhinao.py", line 43, in <module> from vllm.model_executor.parallel_utils.parallel_state import ( ModuleNotFoundError: No module named 'vllm.model_executor.parallel_utils.parallel_state'
On
python -m vllm.entrypoints.openai.api_server --model qihoo360/360Zhinao-7B-Chat-4K --trust-remote-code
finished merge 4097,3950
I'm running into the following issues:
- Completion not working
- Chat template is missing in tokenizer config, the default one will just keep the generation going forever without EOS.
$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qihoo360/360Zhinao-7B-Chat-4K",
"prompt": "Who are you?"
}'
{"id":"cmpl-480c0d4beba84d43a9474e4b83615800","object":"text_completion","created":1713430511,"model":"qihoo360/360Zhinao-7B-Chat-4K","choices":[{"index":0,"text":"<|im_end|>\n<|im_start|><|im_start|><|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":4,"total_tokens":20,"completion_tokens":16}}
I'm running into the following issues:
- Completion not working
- Chat template is missing in tokenizer config, the default one will just keep the generation going forever without EOS.
$ curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qihoo360/360Zhinao-7B-Chat-4K", "prompt": "Who are you?" }' {"id":"cmpl-480c0d4beba84d43a9474e4b83615800","object":"text_completion","created":1713430511,"model":"qihoo360/360Zhinao-7B-Chat-4K","choices":[{"index":0,"text":"<|im_end|>\n<|im_start|><|im_start|><|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":4,"total_tokens":20,"completion_tokens":16}}
it is a chat model, we use the chat api.
curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "qihoo360/360Zhinao-7B-Chat-4K",
"messages": [
{
"role": "user",
"content": "who are you"
}
],
"stream": false,
"messages": [
{
"role": "user",
"content": "who are you"
}
],
"stop_token_ids": [
158326,
158333,
158332
],
"stop": [
"<eod>",
"<|im_end|>",
"<|im_start|>"
]
}'
the result is :
{
"id": "cmpl-afab46b914ac40c192cde2c1d4870b92",
"object": "chat.completion",
"created": 12789567,
"model": "qihoo360/360Zhinao-7B-Chat-4K",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "I am an AI trained to assist with a wide range of tasks and questions. I can help with information on a variety of topics, such as answering questions, setting reminders, and providing news updates."
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 21,
"total_tokens": 62,
"completion_tokens": 41
}
}
we will add this config to tokenizer_config.json later:
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
The tokenizer_config.json should also include the following so it doesn't need to be specified by the client each time. Please let me know once the hf or modelscope version is updated.
"stop_token_ids": [
158326,
158333,
158332
],
"stop": [
"<eod>",
"<|im_end|>",
"<|im_start|>"
]
The tokenizer_config.json should also include the following so it doesn't need to be specified by the client each time. Please let me know once the hf or modelscope version is updated.
"stop_token_ids": [ 158326, 158333, 158332 ], "stop": [ "<eod>", "<|im_end|>", "<|im_start|>" ]
thanks, we will fix it
The tokenizer_config.json should also include the following so it doesn't need to be specified by the client each time. Please let me know once the hf or modelscope version is updated.
"stop_token_ids": [ 158326, 158333, 158332 ], "stop": [ "<eod>", "<|im_end|>", "<|im_start|>" ]
@simon-mo Noticed vllm 0.4.1 have generation_config.get("eos_token_id") now , we use generation_config eos_token_id as the default stop_token_ids. And we add the default template. it works now
curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "qihoo360/360Zhinao-7B-Chat-4K",
"messages": [
{
"role": "user",
"content": "Who are you?"
}
]
}'
{
"id": "cmpl-5be15427a5ad4562b5a1aa792fe12c7e",
"object": "chat.completion",
"created": 1714274507,
"model": "qihoo360/360Zhinao-7B-Chat-4K",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "I am an AI, a computer program designed to assist users with various tasks."
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": 158333
}
],
"usage": {
"prompt_tokens": 22,
"total_tokens": 39,
"completion_tokens": 17
}
}
Looks good, please fix lint by running ./format.sh
./format.sh
OK
$ ./format.sh
vLLM yapf: Done
vLLM mypy:
Success: no issues found in 3 source files
Success: no issues found in 7 source files
Success: no issues found in 4 source files
Success: no issues found in 3 source files
Success: no issues found in 6 source files
Success: no issues found in 2 source files
Success: no issues found in 10 source files
Success: no issues found in 4 source files
vLLM codespell: Done
vLLM ruff:
vLLM isort: Done
https://github.com/vllm-project/vllm/actions/runs/8720404543/job/23921845033?pr=4078#step:5:1
Run yapf --diff --recursive .
--- ./vllm/model_executor/models/zhinao.py (original)
+++ ./vllm/model_executor/models/zhinao.py (reformatted)
@@ -327,7 +327,9 @@
super().__init__()
self.config = config
self.linear_method = linear_method
- self.model = ZhinaoModel(config, linear_method, lora_config=lora_config)
+ self.model = ZhinaoModel(config,
+ linear_method,
+ lora_config=lora_config)
self.unpadded_vocab_size = config.vocab_size
if lora_config:
self.unpadded_vocab_size += lora_config.lora_extra_vocab_size
Did you push the changes?
Also feel free to add to https://github.com/vllm-project/vllm/blob/main/docs/source/models/supported_models.rst and README https://github.com/vllm-project/vllm?tab=readme-ov-file#about
Also feel free to add to https://github.com/vllm-project/vllm/blob/main/docs/source/models/supported_models.rst and README https://github.com/vllm-project/vllm?tab=readme-ov-file#about
updated. thanks
merged v0.4.2 [Misc][Refactor] Generalize linear_method to be quant_method (https://github.com/vllm-project/vllm/pull/4373)
@simon-mo is it ready to merge?