BUG about using the generate function of client.get_model(uid) while using a baichuan2-13B-chat model register by the custom_model model-baichuan2-13b-chat.json
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
To help us to reproduce this bug, please provide information below:
- Your Python version. 3.11
- The version of xinference you use. 0.6.5
- Versions of crucial packages. xinference 0.6.5
- Full stack of the error. 2023-12-05 17:50:05,743 xinference.core.restful_api 222 ERROR Completion stream got an error: [address=0.0.0.0:41457, pid=257] index -1 is out of bounds for dimension 1 with size 0 Traceback (most recent call last): File "/home/miniconda/lib/python3.11/site-packages/xinference/core/restful_api.py", line 535, in stream_results async for item in iterator: File "/home/miniconda/lib/python3.11/site-packages/xinference/core/model.py", line 64, in anext return await self._model_actor_ref.next(self._uid) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/home/miniconda/lib/python3.11/site-packages/xoscar/backends/pool.py", line 657, in send result = await self._run_coro(message.message_id, coro) ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/xoscar/backends/pool.py", line 368, in _run_coro return await coro File "/home/miniconda/lib/python3.11/site-packages/xoscar/api.py", line 306, in on_receive return await super().on_receive(message) # type: ignore ^^^^^^^^^^^^^^^^^ File "xoscar/core.pyx", line 558, in on_receive File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive File "/home/miniconda/lib/python3.11/site-packages/xinference/core/model.py", line 238, in next r = await self._call_wrapper(_wrapper) ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/xinference/core/model.py", line 136, in _call_wrapper return await asyncio.to_thread(_wrapper) ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/xinference/core/model.py", line 227, in _wrapper return next(gen) File "/home/miniconda/lib/python3.11/site-packages/xinference/model/llm/pytorch/core.py", line 258, in generator_wrapper for completion_chunk, _ in generate_stream( ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) ^^^^^^^^^^^^^^^^^ File "/home/miniconda/lib/python3.11/site-packages/xinference/model/llm/pytorch/utils.py", line 196, in generate_stream last_token_logits = logits_processor(tmp_output_ids, logits[:, -1, :])[0] ^^^^^^^^^^^^^^^^^ IndexError: [address=0.0.0.0:41457, pid=257] index -1 is out of bounds for dimension 1 with size 0
- Minimized code to reproduce the error. json registered is below: { "version": 1, "context_length": 4096, "model_name": "baichuan-2-chat-custom", "model_lang": [ "en", "zh" ], "model_ability": [ "chat" ], "model_description": "Baichuan2-chat is a fine-tuned version of the Baichuan LLM, specializing in chatting.", "model_specs": [ { "model_format": "pytorch", "model_size_in_billions": 13, "quantizations": [ "4-bit", "8-bit", "none" ], "model_id": "baichuan-inc/Baichuan2-13B-Chat", "model_uri": "/export/xinference/cache/baichuan-2-chat-pytorch-13b" } ], "prompt_style": { "style_name": "NO_COLON_TWO", "system_prompt": "", "roles": [ "<reserved_106>", "<reserved_107>" ], "intra_message_sep": "", "inter_message_sep": "", "stop_token_ids": [ 2, 195 ] } }
code is below:
from xinference.client import Client
client = Client("http://0.0.0.0:9997")
uid = client.launch_model(model_name='baichuan-2-chat-custom', model_size_in_billions=13, model_format='pytorch', quantization='8-bit')
model_baichuan = client.get_model(uid)
model_baichuan.chat(
"能介绍一下你自己吗?",
chat_history=chat_history,
generate_config={"stream": True,"max_tokens": 4096, "repetition_penalty": 2}, # "stream": True,
)
or
model_baichuan.generate(
"能介绍一下你自己吗?",
generate_config={"stream": True,"max_tokens": 4096, "repetition_penalty": 2}, # "stream": True,
)
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
It seems something wrong at here: xinference/model/llm/pytorch/utils.py line 124 if model.config.is_encoder_decoder: max_src_len = context_len else: max_src_len = context_len - max_new_tokens - 8
input_ids = input_ids[-max_src_len:]
input_echo_len = len(input_ids)
input_ids is set to [] by input_ids = input_ids[-max_src_len:]
Oh I find that the generate_config={"max_tokens": 4096} with "max_tokens" introduce that problem, but why in the example given by here shows we should add a max_tokens key as input?
client = RESTfulClient(base_url=endpoint)
model = client.get_model(model_uid=model_uid)
prompt = input("User: ")
chat_history: "List[ChatCompletionMessage]" = []
chat_history.append(ChatCompletionMessage(role="user", content=prompt))
print("Assistant: ", end="", file=sys.stdout)
response_content = ""
for chunk in model.chat(
prompt=prompt,
chat_history=chat_history,
generate_config={"stream": True, "max_tokens": max_tokens},
):
delta = chunk["choices"][0]["delta"]
if "content" not in delta:
continue
else:
response_content += delta["content"]
print(delta["content"], end="", flush=True, file=sys.stdout)
print("\n", file=sys.stdout)
chat_history.append(
ChatCompletionMessage(role="assistant", content=response_content)
)
So it would raise an error if specify max_tokens and work well without max_tokens?
I took a look at the code, and it seems that Baichuan has special handling. We have a class BaichuanPytorchChatModel for it, but currently, we can't specify the use of this class when customizing. If you want to test it, you can hack the match method of BaichuanPytorchChatModel, add your model name to line73 of baichuan.py.
OK, thanks a lot I will try your suggestion!
OK, thanks a lot I will try your suggestion!
If works, I will create a pull request to fix this.
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 5 days since being marked as stale.