vllm [Model] Add support for 360zhinao

Add support for 360zhinao model

We released the 360Zhinao model series:

360Zhinao-7B-Base
360Zhinao-7B-Chat-4K
360Zhinao-7B-Chat-32K
360Zhinao-7B-Chat-360K

Notable features of our 360Zhinao models are:

Base Model: Leveraging a high-quality corpus of 3.4 trillion tokens consisting of mainly Chinese, English and code, we achieved competitive performance on relevant benchmarks against other 7B models.
Chat Models: Powerful chat capabilities and three context lengths of 4K, 32K and 360K. 360K (around 500k Chinese characters) is the longest context length among Chinese open-sourced models upon release (Apr. 11, 2024).

Apr 15 '24 03:04 garycaokai

@simon-mo can you help us review the code?

Apr 15 '24 14:04 garycaokai

Getting

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/xmo/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args
    engine = cls(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/llm_engine.py", line 121, in __init__
    self.model_executor = executor_class(
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 39, in __init__
    self._init_worker()
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 66, in _init_worker
    self.driver_worker.load_model()
  File "/home/xmo/vllm/vllm/worker/worker.py", line 113, in load_model
    self.model_runner.load_model()
  File "/home/xmo/vllm/vllm/worker/model_runner.py", line 158, in load_model
    self.model = get_model(
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 58, in get_model
    model_class = _get_model_architecture(model_config)[0]
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 41, in _get_model_architecture
    model_cls = ModelRegistry.load_model_cls(arch)
  File "/home/xmo/vllm/vllm/model_executor/models/__init__.py", line 99, in load_model_cls
    module = importlib.import_module(
  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/xmo/vllm/vllm/model_executor/models/zhinao.py", line 43, in <module>
    from vllm.model_executor.parallel_utils.parallel_state import (
ModuleNotFoundError: No module named 'vllm.model_executor.parallel_utils.parallel_state'

On

python -m vllm.entrypoints.openai.api_server --model qihoo360/360Zhinao-7B-Chat-4K --trust-remote-code

Apr 16 '24 21:04 simon-mo

Getting

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/xmo/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args
    engine = cls(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/llm_engine.py", line 121, in __init__
    self.model_executor = executor_class(
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 39, in __init__
    self._init_worker()
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 66, in _init_worker
    self.driver_worker.load_model()
  File "/home/xmo/vllm/vllm/worker/worker.py", line 113, in load_model
    self.model_runner.load_model()
  File "/home/xmo/vllm/vllm/worker/model_runner.py", line 158, in load_model
    self.model = get_model(
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 58, in get_model
    model_class = _get_model_architecture(model_config)[0]
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 41, in _get_model_architecture
    model_cls = ModelRegistry.load_model_cls(arch)
  File "/home/xmo/vllm/vllm/model_executor/models/__init__.py", line 99, in load_model_cls
    module = importlib.import_module(
  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/xmo/vllm/vllm/model_executor/models/zhinao.py", line 43, in <module>
    from vllm.model_executor.parallel_utils.parallel_state import (
ModuleNotFoundError: No module named 'vllm.model_executor.parallel_utils.parallel_state'

On

python -m vllm.entrypoints.openai.api_server --model qihoo360/360Zhinao-7B-Chat-4K --trust-remote-code

This branch works for the vllm 0.4.0 version. I will merge these 2 new refactor：

[Core] Refactor model loading code (https://github.com/vllm-project/vllm/pull/4097) Yard1 Yard1 committed 8 hours ago

[Core][Refactor] move parallel_utils into vllm/distributed (https://github.com/vllm-project/vllm/pull/3950) youkaichao youkaichao committed last week

Apr 17 '24 03:04 garycaokai

Getting

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/xmo/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args
    engine = cls(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/llm_engine.py", line 121, in __init__
    self.model_executor = executor_class(
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 39, in __init__
    self._init_worker()
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 66, in _init_worker
    self.driver_worker.load_model()
  File "/home/xmo/vllm/vllm/worker/worker.py", line 113, in load_model
    self.model_runner.load_model()
  File "/home/xmo/vllm/vllm/worker/model_runner.py", line 158, in load_model
    self.model = get_model(
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 58, in get_model
    model_class = _get_model_architecture(model_config)[0]
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 41, in _get_model_architecture
    model_cls = ModelRegistry.load_model_cls(arch)
  File "/home/xmo/vllm/vllm/model_executor/models/__init__.py", line 99, in load_model_cls
    module = importlib.import_module(
  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/xmo/vllm/vllm/model_executor/models/zhinao.py", line 43, in <module>
    from vllm.model_executor.parallel_utils.parallel_state import (
ModuleNotFoundError: No module named 'vllm.model_executor.parallel_utils.parallel_state'

On

python -m vllm.entrypoints.openai.api_server --model qihoo360/360Zhinao-7B-Chat-4K --trust-remote-code

finished merge 4097,3950

Apr 17 '24 10:04 garycaokai

I'm running into the following issues:

Completion not working
Chat template is missing in tokenizer config, the default one will just keep the generation going forever without EOS.

$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qihoo360/360Zhinao-7B-Chat-4K",
        "prompt": "Who are you?"
    }'
{"id":"cmpl-480c0d4beba84d43a9474e4b83615800","object":"text_completion","created":1713430511,"model":"qihoo360/360Zhinao-7B-Chat-4K","choices":[{"index":0,"text":"<|im_end|>\n<|im_start|><|im_start|><|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":4,"total_tokens":20,"completion_tokens":16}}

Apr 18 '24 08:04 simon-mo

I'm running into the following issues:

Completion not working
Chat template is missing in tokenizer config, the default one will just keep the generation going forever without EOS.

$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qihoo360/360Zhinao-7B-Chat-4K",
        "prompt": "Who are you?"
    }'
{"id":"cmpl-480c0d4beba84d43a9474e4b83615800","object":"text_completion","created":1713430511,"model":"qihoo360/360Zhinao-7B-Chat-4K","choices":[{"index":0,"text":"<|im_end|>\n<|im_start|><|im_start|><|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":4,"total_tokens":20,"completion_tokens":16}}

it is a chat model, we use the chat api.

curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "qihoo360/360Zhinao-7B-Chat-4K",
    "messages": [
        {
            "role": "user",
            "content": "who are you"
        }
    ],
    "stream": false,
    "messages": [
        {
            "role": "user",
            "content": "who are you"
        }
    ],
    "stop_token_ids": [
        158326,
        158333,
        158332
    ],
    "stop": [
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ]
}'

the result is :

{
    "id": "cmpl-afab46b914ac40c192cde2c1d4870b92",
    "object": "chat.completion",
    "created": 12789567,
    "model": "qihoo360/360Zhinao-7B-Chat-4K",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "I am an AI trained to assist with a wide range of tasks and questions. I can help with information on a variety of topics, such as answering questions, setting reminders, and providing news updates."
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 21,
        "total_tokens": 62,
        "completion_tokens": 41
    }
}

we will add this config to tokenizer_config.json later: "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"

Apr 18 '24 12:04 garycaokai

The tokenizer_config.json should also include the following so it doesn't need to be specified by the client each time. Please let me know once the hf or modelscope version is updated.

    "stop_token_ids": [
        158326,
        158333,
        158332
    ],
    "stop": [
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ]

Apr 18 '24 18:04 simon-mo

The tokenizer_config.json should also include the following so it doesn't need to be specified by the client each time. Please let me know once the hf or modelscope version is updated.
    "stop_token_ids": [
        158326,
        158333,
        158332
    ],
    "stop": [
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ]

thanks, we will fix it

Apr 19 '24 04:04 garycaokai

The tokenizer_config.json should also include the following so it doesn't need to be specified by the client each time. Please let me know once the hf or modelscope version is updated.
    "stop_token_ids": [
        158326,
        158333,
        158332
    ],
    "stop": [
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ]

@simon-mo Noticed vllm 0.4.1 have generation_config.get("eos_token_id") now , we use generation_config eos_token_id as the default stop_token_ids. And we add the default template. it works now

curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "qihoo360/360Zhinao-7B-Chat-4K",
    "messages": [
        {
            "role": "user",
            "content": "Who are you?"
        }
    ]
}'
{
    "id": "cmpl-5be15427a5ad4562b5a1aa792fe12c7e",
    "object": "chat.completion",
    "created": 1714274507,
    "model": "qihoo360/360Zhinao-7B-Chat-4K",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "I am an AI, a computer program designed to assist users with various tasks."
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": 158333
        }
    ],
    "usage": {
        "prompt_tokens": 22,
        "total_tokens": 39,
        "completion_tokens": 17
    }
}

Apr 28 '24 03:04 garycaokai

Looks good, please fix lint by running ./format.sh

May 02 '24 18:05 simon-mo

./format.sh

OK

$ ./format.sh 
vLLM yapf: Done
vLLM mypy:
Success: no issues found in 3 source files
Success: no issues found in 7 source files
Success: no issues found in 4 source files
Success: no issues found in 3 source files
Success: no issues found in 6 source files
Success: no issues found in 2 source files
Success: no issues found in 10 source files
Success: no issues found in 4 source files
vLLM codespell: Done
vLLM ruff:
vLLM isort: Done

May 03 '24 02:05 garycaokai

https://github.com/vllm-project/vllm/actions/runs/8720404543/job/23921845033?pr=4078#step:5:1

Run yapf --diff --recursive .
--- ./vllm/model_executor/models/zhinao.py	(original)
+++ ./vllm/model_executor/models/zhinao.py	(reformatted)
@@ -327,7 +327,9 @@
         super().__init__()
         self.config = config
         self.linear_method = linear_method
-        self.model = ZhinaoModel(config, linear_method, lora_config=lora_config)
+        self.model = ZhinaoModel(config,
+                                 linear_method,
+                                 lora_config=lora_config)
         self.unpadded_vocab_size = config.vocab_size
         if lora_config:
             self.unpadded_vocab_size += lora_config.lora_extra_vocab_size

May 03 '24 02:05 simon-mo

Did you push the changes?

May 03 '24 02:05 simon-mo

Also feel free to add to https://github.com/vllm-project/vllm/blob/main/docs/source/models/supported_models.rst and README https://github.com/vllm-project/vllm?tab=readme-ov-file#about

May 03 '24 03:05 simon-mo

Also feel free to add to https://github.com/vllm-project/vllm/blob/main/docs/source/models/supported_models.rst and README https://github.com/vllm-project/vllm?tab=readme-ov-file#about

updated. thanks

May 06 '24 02:05 garycaokai

merged v0.4.2 [Misc][Refactor] Generalize linear_method to be quant_method (https://github.com/vllm-project/vllm/pull/4373)

May 06 '24 08:05 garycaokai

@simon-mo is it ready to merge?

May 07 '24 08:05 garycaokai

vllm vllm copied to clipboard

[Model] Add support for 360zhinao

vllm
vllm copied to clipboard