Glm4-9b-inference输出错误ISSUE

Open jessie-zhao opened this issue 4 months ago • 1 comments

用以下方式验证glm4-9b-chat模型的输出，serving端报错

curl --request POST
--url http://127.0.0.1:8000/v1/chat/completions
--header 'content-type: application/json'
--data '{ "model": "glm-4-9b-chat", "temperature": 0.7, "top_p": 0.8, "messages": [ { "role": "system", "content": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n" }, { "role": "user", "content": "你是谁" } ], "max_tokens": 1024, "repetition_penalty": 1.0 }'

Serving端启动脚本 #!/bin/bash model="/llm/models/glm-4-9b-chat" served_model_name="glm-4-9b-chat"

export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_CACHE_PERSISTENT=1 export TORCH_LLM_ALLREDUCE=0 export CCL_DG2_ALLREDUCE=1

Tensor parallel related arguments:

export CCL_WORKER_COUNT=1 export FI_PROVIDER=shm export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets export CCL_ATL_SHM=1 source /opt/intel/oneapi/setvars.sh source /opt/intel/1ccl-wks/setvars.sh python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server
--served-model-name $served_model_name
--port 8000
--model $model
--trust-remote-code
--gpu-memory-utilization 0.9
--device xpu
--dtype float16
--enforce-eager
--load-in-low-bit fp8
--max-model-len 2048
--max-num-batched-tokens 4000
--tensor-parallel-size 1

Serving端报错： INFO: 127.0.0.1:58348 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi result = await app( # type: ignore[func-returns-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in call return await self.app(scope, receive, send) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/applications.py", line 113, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/errors.py", line 187, in call raise exc File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/errors.py", line 165, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/exceptions.py", line 62, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app raise exc File "/usr/local/lib/python3.11/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 715, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 735, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 288, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 76, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app raise exc File "/usr/local/lib/python3.11/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 73, in app response = await f(request) ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/fastapi/routing.py", line 301, in app raw_response = await run_endpoint_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/fastapi/routing.py", line 212, in run_endpoint_function return await dependant.call(**values) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 191, in create_chat_completion generator = await openai_serving_chat.create_chat_completion( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 132, in create_chat_completion prompt_inputs = self._tokenize_prompt_input( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/entrypoints/openai/serving_engine.py", line 291, in _tokenize_prompt_input return next( ^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/entrypoints/openai/serving_engine.py", line 314, in _tokenize_prompt_inputs yield self._normalize_prompt_text_to_input( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/entrypoints/openai/serving_engine.py", line 206, in _normalize_prompt_text_to_input encoded = tokenizer(prompt, add_special_tokens=add_special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 3024, in call encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 3134, in _call_one return self.encode_plus( ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 3210, in encode_plus return self._encode_plus( ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils.py", line 801, in _encode_plus return self.prepare_for_model( ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 3706, in prepare_for_model encoded_inputs = self.pad( ^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 3508, in pad encoded_inputs = self._pad( ^^^^^^^^^^ TypeError: ChatGLM4Tokenizer._pad() got an unexpected keyword argument 'padding_side'

Sep 29 '24 01:09 jessie-zhao

ipex-llm ipex-llm copied to clipboard

Glm4-9b-inference输出错误ISSUE

Tensor parallel related arguments:

ipex-llm
ipex-llm copied to clipboard