inference BUG about using the generate function of client.get_model(uid) while using a baichuan2-13B-chat model register by the custom

Describe the bug

A clear and concise description of what the bug is.

To Reproduce

To help us to reproduce this bug, please provide information below:

Your Python version. 3.11
The version of xinference you use. 0.6.5
Versions of crucial packages. xinference 0.6.5
Full stack of the error. 2023-12-05 17:50:05,743 xinference.core.restful_api 222 ERROR Completion stream got an error: [address=0.0.0.0:41457, pid=257] index -1 is out of bounds for dimension 1 with size 0 Traceback (most recent call last): File "/home/miniconda/lib/python3.11/site-packages/xinference/core/restful_api.py", line 535, in stream_results async for item in iterator: File "/home/miniconda/lib/python3.11/site-packages/xinference/core/model.py", line 64, in anext return await self._model_actor_ref.next(self._uid) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/home/miniconda/lib/python3.11/site-packages/xoscar/backends/pool.py", line 657, in send result = await self._run_coro(message.message_id, coro) ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/xoscar/backends/pool.py", line 368, in _run_coro return await coro File "/home/miniconda/lib/python3.11/site-packages/xoscar/api.py", line 306, in on_receive return await super().on_receive(message) # type: ignore ^^^^^^^^^^^^^^^^^ File "xoscar/core.pyx", line 558, in on_receive File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive File "/home/miniconda/lib/python3.11/site-packages/xinference/core/model.py", line 238, in next r = await self._call_wrapper(_wrapper) ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/xinference/core/model.py", line 136, in _call_wrapper return await asyncio.to_thread(_wrapper) ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/xinference/core/model.py", line 227, in _wrapper return next(gen) File "/home/miniconda/lib/python3.11/site-packages/xinference/model/llm/pytorch/core.py", line 258, in generator_wrapper for completion_chunk, _ in generate_stream( ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/xinference/model/llm/pytorch/utils.py", line 196, in generate_stream last_token_logits = logits_processor(tmp_output_ids, logits[:, -1, :])[0] ^^^^^^^^^^^^^^^^^ IndexError: [address=0.0.0.0:41457, pid=257] index -1 is out of bounds for dimension 1 with size 0
Minimized code to reproduce the error. json registered is below: { "version": 1, "context_length": 4096, "model_name": "baichuan-2-chat-custom", "model_lang": [ "en", "zh" ], "model_ability": [ "chat" ], "model_description": "Baichuan2-chat is a fine-tuned version of the Baichuan LLM, specializing in chatting.", "model_specs": [ { "model_format": "pytorch", "model_size_in_billions": 13, "quantizations": [ "4-bit", "8-bit", "none" ], "model_id": "baichuan-inc/Baichuan2-13B-Chat", "model_uri": "/export/xinference/cache/baichuan-2-chat-pytorch-13b" } ], "prompt_style": { "style_name": "NO_COLON_TWO", "system_prompt": "", "roles": [ "<reserved_106>", "<reserved_107>" ], "intra_message_sep": "", "inter_message_sep": "", "stop_token_ids": [ 2, 195 ] } }

code is below:

from xinference.client import Client client = Client("http://0.0.0.0:9997") uid = client.launch_model(model_name='baichuan-2-chat-custom', model_size_in_billions=13, model_format='pytorch', quantization='8-bit') model_baichuan = client.get_model(uid) model_baichuan.chat( "能介绍一下你自己吗？", chat_history=chat_history, generate_config={"stream": True,"max_tokens": 4096, "repetition_penalty": 2}, # "stream": True, ) or model_baichuan.generate( "能介绍一下你自己吗？", generate_config={"stream": True,"max_tokens": 4096, "repetition_penalty": 2}, # "stream": True, ) Uploading 微信图片_20231205184632.png…

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Dec 05 '23 10:12 JiangDonglai98

It seems something wrong at here: xinference/model/llm/pytorch/utils.py line 124 if model.config.is_encoder_decoder: max_src_len = context_len else: max_src_len = context_len - max_new_tokens - 8

input_ids = input_ids[-max_src_len:]
input_echo_len = len(input_ids)

input_ids is set to [] by input_ids = input_ids[-max_src_len:]

Dec 05 '23 12:12 JiangDonglai98

Oh I find that the generate_config={"max_tokens": 4096} with "max_tokens" introduce that problem, but why in the example given by here shows we should add a max_tokens key as input?

client = RESTfulClient(base_url=endpoint)
model = client.get_model(model_uid=model_uid)

prompt = input("User: ")
chat_history: "List[ChatCompletionMessage]" = []
chat_history.append(ChatCompletionMessage(role="user", content=prompt))
print("Assistant: ", end="", file=sys.stdout)
response_content = ""
for chunk in model.chat(
    prompt=prompt,
    chat_history=chat_history,
    generate_config={"stream": True, "max_tokens": max_tokens},
):
    delta = chunk["choices"][0]["delta"]
    if "content" not in delta:
        continue
    else:
        response_content += delta["content"]
        print(delta["content"], end="", flush=True, file=sys.stdout)
print("\n", file=sys.stdout)
chat_history.append(
    ChatCompletionMessage(role="assistant", content=response_content)
)

Dec 05 '23 12:12 JiangDonglai98

So it would raise an error if specify max_tokens and work well without max_tokens?

Dec 05 '23 14:12 aresnow1

I took a look at the code, and it seems that Baichuan has special handling. We have a class BaichuanPytorchChatModel for it, but currently, we can't specify the use of this class when customizing. If you want to test it, you can hack the match method of BaichuanPytorchChatModel, add your model name to line73 of baichuan.py.

Dec 05 '23 14:12 aresnow1

OK, thanks a lot I will try your suggestion!

Dec 06 '23 02:12 JiangDonglai98

OK, thanks a lot I will try your suggestion!

If works, I will create a pull request to fix this.

Dec 06 '23 09:12 aresnow1

This issue is stale because it has been open for 7 days with no activity.

Aug 07 '24 19:08 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

Aug 12 '24 19:08 github-actions[bot]

BUG about using the generate function of client.get_model(uid) while using a baichuan2-13B-chat model register by the custom_model model-baichuan2-13b-chat.json

Describe the bug

To Reproduce

Expected behavior

Additional context