lmdeploy [Bug] llama3.1 70B v1/chat/completions error on Huawei Ascend 910B

[Bug] llama3.1 70B v1/chat/completions error on Huawei Ascend 910B

Open nullxjx opened this issue 4 months ago • 4 comments

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

使用910B推理llama3.1 70B模型，对话接口 v1/chat/completions 会报错 TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

虽然使用 https://github.com/InternLM/lmdeploy/issues/2117#issuecomment-2249349074 这里的方法可以规避这个错误，但是感觉这个方法的推理质量存在问题，希望支持原生的 chat_teamplate。

Reproduction

使用最新的代码构建910B上的推理镜像
启动docker

cmd=bash

docker run -it -e ASCEND_VISIBLE_DEVICES=0 --privileged -u root --name lmdeploy --network eval \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /data/models:/data/models \
-p 23333:23333 \
lmdeploy-aarch64-ascend:latest \
$cmd

启动服务 ASCEND_VISIBLE_DEVICES=0,1,2,3 lmdeploy serve api_server --backend pytorch --device ascend --tp 4 --model-name codewise-chat /data/models/llama3.1_70B
访问 v1/chat/completions 接口

curl lmdeploy:23333/v1/chat/completions \
 -X POST \
 -H "Content-Type: application/json" \
 -d '{
   "model": "codewise-chat",
   "messages": [
       {"role": "system", "content": "You are a helpful assistant."},
       {"role": "user", "content": "如何使用nginx进行负载均衡？"}
   ],
   "max_tokens": 128,
   "temperature": 1,
   "stream": true,
   "skip_special_tokens": false
 }'

Environment

root@b8eb9e7a22fa:/opt/lmdeploy# lmdeploy check_env
[W compiler_depend.ts:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
sys.platform: linux
Python: 3.10.5 (main, Sep 24 2024, 08:21:42) [GCC 9.4.0]
CUDA available: False
MUSA available: False
numpy_random_seed: 2147483648
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.1.0
PyTorch compiling details: PyTorch built with:
  - GCC 10.2
  - C++ Version: 201703
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: NO AVX
  - Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-10/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=open, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=OFF, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.0
LMDeploy: 0.6.0+0fbb2ee
transformers: 4.44.2
gradio: Not Found
fastapi: 0.115.0
pydantic: 2.9.2
triton: Not Found
root@b8eb9e7a22fa:/opt/lmdeploy#

Error traceback

INFO:     172.18.0.2:39038 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/lmdeploy/lmdeploy/serve/openai/api_server.py", line 475, in chat_completions_v1
    async for res in result_generator:
  File "/opt/lmdeploy/lmdeploy/serve/async_engine.py", line 514, in generate
    prompt_input = await self._get_prompt_input(prompt,
  File "/opt/lmdeploy/lmdeploy/serve/async_engine.py", line 458, in _get_prompt_input
    input_ids = self.tokenizer.encode(prompt, add_bos=sequence_start)
  File "/opt/lmdeploy/lmdeploy/tokenizer.py", line 600, in encode
    return self.model.encode(s, add_bos, add_special_tokens, **kwargs)
  File "/opt/lmdeploy/lmdeploy/tokenizer.py", line 366, in encode
    encoded = self.model.encode(s,
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2825, in encode
    encoded_inputs = self.encode_plus(
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3237, in encode_plus
    return self._encode_plus(
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 601, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 528, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Sep 25 '24 12:09 nullxjx

lmdeploy lmdeploy copied to clipboard

[Bug] llama3.1 70B v1/chat/completions error on Huawei Ascend 910B

Checklist

Describe the bug

Reproduction

Environment

Error traceback

lmdeploy
lmdeploy copied to clipboard