lmdeploy build(ascend): add Dockerfile for ascend aarch64 910B

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Providing a Dockerfile for running ascend backends with pytorch engine, Currently only Dockerfile for aarch64 platform is prepared.

Modification

Add Dockerfile for ascend aarch64 910B

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

Aug 09 '24 11:08 CyCle1024

I got this error when trying to import torch_dipu inside the container:

ImportError: /deeplink/deeplink.framework/dipu/torch_dipu/libtorch_dipu.so: undefined symbol: aclprofSetStampCallStack

The CANN used in the container is 8.0.RC3.alpha001

Aug 12 '24 02:08 huyz-git

I got this error when trying to import torch_dipu inside the container:
ImportError: /deeplink/deeplink.framework/dipu/torch_dipu/libtorch_dipu.so: undefined symbol: aclprofSetStampCallStack
The CANN used in the container is 8.0.RC3.alpha001

deeplink.framework supports 8.0.RC1.alpha003，other versions are not tested for now.

Aug 12 '24 08:08 CyCle1024

构建好docker镜像后运行：lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch 得到的报错：但是triton这个库在aarch64上没有提供预编译好的包，自行编译也失败了。

Aug 15 '24 02:08 yunfwe

构建好docker镜像后运行：lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch 得到的报错：但是triton这个库在aarch64上没有提供预编译好的包，自行编译也失败了。

目前ascend平台支持的模型不包括Qwen2-7B-Instruct，并且api_server尚未支持设置输入device_type参数以选择ascend后端。

Aug 15 '24 03:08 CyCle1024

构建好docker镜像后运行：lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch 得到的报错：但是triton这个库在aarch64上没有提供预编译好的包，自行编译也失败了。

@yunfwe 目前支持的模型为 llama2-7b, internlm2-7b, mixtral-8x7b，可以参考以下脚本进行静态的推理，chat版本的功能还在开发中：

import deeplink_ext
import lmdeploy
from lmdeploy import PytorchEngineConfig

if __name__ == "__main__":
    backend_config = PytorchEngineConfig(tp=1, cache_max_entry_count=0.3,
                                         device_type="ascend")
    pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b",
                             backend_config=backend_config)
    question = ["上海有什么美食？"]
    response = pipe(question, request_output_len=128, do_preprocess=True)
    for idx, r in enumerate(response):
        print(f"Question: {question[idx]}")
        print(f"Answer: {r.text}")
        print()

Aug 15 '24 05:08 CyCle1024

构建好docker镜像后运行：lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch 得到的报错：但是triton这个库在aarch64上没有提供预编译好的包，自行编译也失败了。

@yunfwe 目前支持的模型为 llama2-7b, internlm2-7b, mixtral-8x7b，可以参考以下脚本进行静态的推理，chat版本的功能还在开发中：
import deeplink_ext
import lmdeploy
from lmdeploy import PytorchEngineConfig

if __name__ == "__main__":
    backend_config = PytorchEngineConfig(tp=1, cache_max_entry_count=0.3,
                                         device_type="ascend")
    pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b",
                             backend_config=backend_config)
    question = ["上海有什么美食？"]
    response = pipe(question, request_output_len=128, do_preprocess=True)
    for idx, r in enumerate(response):
        print(f"Question: {question[idx]}")
        print(f"Answer: {r.text}")
        print()

感谢解惑

Aug 15 '24 06:08 yunfwe

@RunningLeon may open another PR to add device_type in CLI

Aug 15 '24 06:08 lvhan028

@RunningLeon may open another PR to add device_type in CLI

OK

Aug 15 '24 06:08 RunningLeon

Using CANN version 8.0.RC1.alpha003 I can successfully run the container. However, after I modify the device_type parameter and let lmdeploy run API server on ascend backend, I got extremely slow inference speed compared to Ascend MindIE, is this normal?

	Prefill performance, token/s	Decode performance, token/s
	batch_size = 1 / 10 / 50	batch_size = 1 / 10 / 100 / 200
	input = 1000 tokens, output = 1 token	input = 1 token, output = 100 tokens
lmdeploy	1238 / 1693 / 1837	15 / 131 / 454 / 441
MindIE	11458 / 18956 / 20061	68 / 643 / 4435 / 4442

The model is Yi-1.5-6B-Chat.

Aug 15 '24 09:08 huyz-git

Using CANN version 8.0.RC1.alpha003 I can successfully run the container. However, after I modify the device_type parameter and let lmdeploy run API server on ascend backend, I got extremely slow inference speed compared to Ascend MindIE, is this normal?

Prefill performance, token/s Decode performance, token/s batch_size = 1 / 10 / 50 batch_size = 1 / 10 / 100 / 200 input = 1000 tokens, output = 1 token input = 1 token, output = 100 tokens lmdeploy 1238 / 1693 / 1837 15 / 131 / 454 / 441 MindIE 11458 / 18956 / 20061 68 / 643 / 4435 / 4442 The model is Yi-1.5-6B-Chat.

The current version is slower than MindIE. It is based on eager mode and is not fully optimized (If you have a Huawei machine with an Intel CPU, you can get 3x performance without any changes.) MindIE is based on graph mode, so it shows better performance. We are working on graph mode and will release the graph mode version of 910b on lmdeploy by the end of October.

Aug 15 '24 15:08 jinminxi104

请问，有人遇到ValueError: xpu is not available, you should use device="cpu" instead的错误嘛？我使用的是RC1，910B2C

Aug 16 '24 09:08 dengyingxu

请问，有人遇到ValueError: xpu is not available, you should use device="cpu" instead的错误嘛？我使用的是RC1，910B2C

能否附上测试脚本test_deploy.py？

Aug 19 '24 05:08 CyCle1024

Any image on Docker Hub？

Sep 13 '24 09:09 yunyu-Mr

Using CANN version 8.0.RC1.alpha003 I can successfully run the container. However, after I modify the device_type parameter and let lmdeploy run API server on ascend backend, I got extremely slow inference speed compared to Ascend MindIE, is this normal? Prefill performance, token/s Decode performance, token/s batch_size = 1 / 10 / 50 batch_size = 1 / 10 / 100 / 200 input = 1000 tokens, output = 1 token input = 1 token, output = 100 tokens lmdeploy 1238 / 1693 / 1837 15 / 131 / 454 / 441 MindIE 11458 / 18956 / 20061 68 / 643 / 4435 / 4442 The model is Yi-1.5-6B-Chat.

The current version is slower than MindIE. It is based on eager mode and is not fully optimized (If you have a Huawei machine with an Intel CPU, you can get 3x performance without any changes.) MindIE is based on graph mode, so it shows better performance. We are working on graph mode and will release the graph mode version of 910b on lmdeploy by the end of October.

Will LMDeploy become a competitor to MindIE? As a user of Ascend 910B, which inference and serving engine should I chose? Related issue: https://github.com/vllm-project/vllm/pull/8054#issuecomment-2454022186

Nov 04 '24 08:11 sisrfeng

Any image on Docker Hub？

No, please use the dockerfile. (some compliance reasons..)

Nov 20 '24 07:11 jinminxi104

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

Nov 20 '24 08:11 jinminxi104

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

I tested the performance of the graph mode:

For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE.

Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.

Nov 21 '24 03:11 huyz-git

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

I tested the performance of the graph mode:

For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE.

Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.

Thanks for your testing. Assuming you have a KUNPENG cpu, here is my response: （Intel CPU will be a totally different story...） for 1 NPU testing, it is meet our expect. Large batch size hurts performance since kunpeng cpu shows a bad performance on detokenize. we are working on this issue. for 4 npus testing with small batch size, we will analyze the gap to MindIE. Acctually, I read my mindie code(I am not sure the code on my side is the same code you have), it has simpler post-processing, no dynamic memory allocation, and no streaming output. (our latest release is lmdeploy 0.6.3, dlinfer-ascend 0.1.2)

Nov 21 '24 11:11 jinminxi104

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

I tested the performance of the graph mode: For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE. Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.

Thanks for your testing. Assuming you have a KUNPENG cpu, here is my response: （Intel CPU will be a totally different story...） for 1 NPU testing, it is meet our expect. Large batch size hurts performance since kunpeng cpu shows a bad performance on detokenize. we are working on this issue. for 4 npus testing with small batch size, we will analyze the gap to MindIE. Acctually, I read my mindie code(I am not sure the code on my side is the same code you have), it has simpler post-processing, no dynamic memory allocation, and no streaming output. (our latest release is lmdeploy 0.6.3, dlinfer-ascend 0.1.2)

Thanks for reply.

During testing, I also found a strange behavior: the graph mode will sometimes stuck for a while, with CPU single core 100% but nearly no NPU core usage.

Specifically, after the inference server started:

I test decode phase first. First I test batch size 1, and then the server will stuck for a while, I have to re-test to get the expected result
After that, I test batch size 10, and the server will also stuck for a while, I also have to re-test.
After that, the test of batch size 100 and 200 is normal, the server will not stuck anymore.
When the test of decode phase finishes, I begin to test prefill phase. The situation is somehow similar to the decode phase. First I test batch size 1 and the server will stuck for a while.
After that, the test of batch size 10 and 50 is normal.

This phenomenon also happens in normal usage. After the server started:

First I open a terminal and send a stream request using curl, and the server will stuck for a while.
After that request finishes, re-send a request and it is processed normally.
Then I open two terminals, send a stream request with a large output tokens, when the stream started but not finished, I send another stream request in another terminal. Now the first stream stucks, and after a while it resumes and the second stream starts.

Eager mode does not have such phenomenon.

Is this a bug or expected behavior?

Nov 22 '24 01:11 huyz-git

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

I tested the performance of the graph mode: For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE. Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.

Thanks for your testing. Assuming you have a KUNPENG cpu, here is my response: （Intel CPU will be a totally different story...） for 1 NPU testing, it is meet our expect. Large batch size hurts performance since kunpeng cpu shows a bad performance on detokenize. we are working on this issue. for 4 npus testing with small batch size, we will analyze the gap to MindIE. Acctually, I read my mindie code(I am not sure the code on my side is the same code you have), it has simpler post-processing, no dynamic memory allocation, and no streaming output. (our latest release is lmdeploy 0.6.3, dlinfer-ascend 0.1.2)

Thanks for reply.

During testing, I also found a strange behavior: the graph mode will sometimes stuck for a while, with CPU single core 100% but nearly no NPU core usage.

Specifically, after the inference server started:

I test decode phase first. First I test batch size 1, and then the server will stuck for a while, I have to re-test to get the expected result

After that, I test batch size 10, and the server will also stuck for a while, I also have to re-test.

After that, the test of batch size 100 and 200 is normal, the server will not stuck anymore.

When the test of decode phase finishes, I begin to test prefill phase. The situation is somehow similar to the decode phase. First I test batch size 1 and the server will stuck for a while.

After that, the test of batch size 10 and 50 is normal.

This phenomenon also happens in normal usage. After the server started:

First I open a terminal and send a stream request using curl, and the server will stuck for a while.

After that request finishes, re-send a request and it is processed normally.

Then I open two terminals, send a stream request with a large output tokens, when the stream started but not finished, I send another stream request in another terminal. Now the first stream stucks, and after a while it resumes and the second stream starts.

Eager mode does not have such phenomenon.

Is this a bug or expected behavior?

sorry for my late respose. The "stuck for a while" is called warmup which only occurs in graph mode. In warmup phase, pytorch code is compiled for calling ascend toolkits. After warmup phase, we call compiled funciton to boost performance.

Dec 18 '24 08:12 jinminxi104