build(ascend): add Dockerfile for ascend aarch64 910B
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Providing a Dockerfile for running ascend backends with pytorch engine, Currently only Dockerfile for aarch64 platform is prepared.
Modification
Add Dockerfile for ascend aarch64 910B
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist
- Pre-commit or other linting tools are used to fix the potential lint issues.
- The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
- If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
- The documentation has been modified accordingly, like docstring or example tutorials.
I got this error when trying to import torch_dipu inside the container:
ImportError: /deeplink/deeplink.framework/dipu/torch_dipu/libtorch_dipu.so: undefined symbol: aclprofSetStampCallStack
The CANN used in the container is 8.0.RC3.alpha001
I got this error when trying to import
torch_dipuinside the container:ImportError: /deeplink/deeplink.framework/dipu/torch_dipu/libtorch_dipu.so: undefined symbol: aclprofSetStampCallStackThe
CANNused in the container is8.0.RC3.alpha001
deeplink.framework supports 8.0.RC1.alpha003,other versions are not tested for now.
构建好docker镜像后运行:lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch
得到的报错:
但是triton这个库在aarch64上没有提供预编译好的包,自行编译也失败了。
构建好docker镜像后运行:lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch 得到的报错:
但是triton这个库在aarch64上没有提供预编译好的包,自行编译也失败了。
目前ascend平台支持的模型不包括Qwen2-7B-Instruct,并且api_server尚未支持设置输入device_type参数以选择ascend后端。
构建好docker镜像后运行:lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch 得到的报错:
但是triton这个库在aarch64上没有提供预编译好的包,自行编译也失败了。
@yunfwe 目前支持的模型为 llama2-7b, internlm2-7b, mixtral-8x7b,可以参考以下脚本进行静态的推理,chat版本的功能还在开发中:
import deeplink_ext
import lmdeploy
from lmdeploy import PytorchEngineConfig
if __name__ == "__main__":
backend_config = PytorchEngineConfig(tp=1, cache_max_entry_count=0.3,
device_type="ascend")
pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b",
backend_config=backend_config)
question = ["上海有什么美食?"]
response = pipe(question, request_output_len=128, do_preprocess=True)
for idx, r in enumerate(response):
print(f"Question: {question[idx]}")
print(f"Answer: {r.text}")
print()
构建好docker镜像后运行:lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch 得到的报错:
但是triton这个库在aarch64上没有提供预编译好的包,自行编译也失败了。
@yunfwe 目前支持的模型为 llama2-7b, internlm2-7b, mixtral-8x7b,可以参考以下脚本进行静态的推理,chat版本的功能还在开发中:
import deeplink_ext import lmdeploy from lmdeploy import PytorchEngineConfig if __name__ == "__main__": backend_config = PytorchEngineConfig(tp=1, cache_max_entry_count=0.3, device_type="ascend") pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b", backend_config=backend_config) question = ["上海有什么美食?"] response = pipe(question, request_output_len=128, do_preprocess=True) for idx, r in enumerate(response): print(f"Question: {question[idx]}") print(f"Answer: {r.text}") print()
感谢解惑
@RunningLeon may open another PR to add device_type in CLI
@RunningLeon may open another PR to add
device_typein CLI
OK
Using CANN version 8.0.RC1.alpha003 I can successfully run the container.
However, after I modify the device_type parameter and let lmdeploy run API server on ascend backend, I got extremely slow inference speed compared to Ascend MindIE, is this normal?
| Prefill performance, token/s | Decode performance, token/s | |
|---|---|---|
| batch_size = 1 / 10 / 50 | batch_size = 1 / 10 / 100 / 200 | |
| input = 1000 tokens, output = 1 token | input = 1 token, output = 100 tokens | |
| lmdeploy | 1238 / 1693 / 1837 | 15 / 131 / 454 / 441 |
| MindIE | 11458 / 18956 / 20061 | 68 / 643 / 4435 / 4442 |
The model is Yi-1.5-6B-Chat.
Using
CANNversion8.0.RC1.alpha003I can successfully run the container. However, after I modify thedevice_typeparameter and letlmdeployrun API server onascendbackend, I got extremely slow inference speed compared toAscend MindIE, is this normal?Prefill performance, token/s Decode performance, token/s batch_size = 1 / 10 / 50 batch_size = 1 / 10 / 100 / 200 input = 1000 tokens, output = 1 token input = 1 token, output = 100 tokens lmdeploy 1238 / 1693 / 1837 15 / 131 / 454 / 441 MindIE 11458 / 18956 / 20061 68 / 643 / 4435 / 4442 The model is
Yi-1.5-6B-Chat.
The current version is slower than MindIE. It is based on eager mode and is not fully optimized (If you have a Huawei machine with an Intel CPU, you can get 3x performance without any changes.) MindIE is based on graph mode, so it shows better performance. We are working on graph mode and will release the graph mode version of 910b on lmdeploy by the end of October.
请问,有人遇到ValueError: xpu is not available, you should use device="cpu" instead的错误嘛? 我使用的是RC1,910B2C
能否附上测试脚本test_deploy.py?
Any image on Docker Hub?
Using
CANNversion8.0.RC1.alpha003I can successfully run the container. However, after I modify thedevice_typeparameter and letlmdeployrun API server onascendbackend, I got extremely slow inference speed compared toAscend MindIE, is this normal? Prefill performance, token/s Decode performance, token/s batch_size = 1 / 10 / 50 batch_size = 1 / 10 / 100 / 200 input = 1000 tokens, output = 1 token input = 1 token, output = 100 tokens lmdeploy 1238 / 1693 / 1837 15 / 131 / 454 / 441 MindIE 11458 / 18956 / 20061 68 / 643 / 4435 / 4442 The model isYi-1.5-6B-Chat.The current version is slower than MindIE. It is based on eager mode and is not fully optimized (If you have a Huawei machine with an Intel CPU, you can get 3x performance without any changes.) MindIE is based on graph mode, so it shows better performance. We are working on graph mode and will release the graph mode version of 910b on lmdeploy by the end of October.
Will LMDeploy become a competitor to MindIE? As a user of Ascend 910B, which inference and serving engine should I chose? Related issue: https://github.com/vllm-project/vllm/pull/8054#issuecomment-2454022186
Any image on Docker Hub?
No, please use the dockerfile. (some compliance reasons..)
Will LMDeploy become a competitor to MindIE?
Yes, we have graph mode, and capture graph via torch.dynamo.
Will LMDeploy become a competitor to MindIE?
Yes, we have graph mode, and capture graph via torch.dynamo.
I tested the performance of the graph mode:
For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE.
Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.
Will LMDeploy become a competitor to MindIE?
Yes, we have graph mode, and capture graph via torch.dynamo.
I tested the performance of the graph mode:
For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE.
Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.
Thanks for your testing. Assuming you have a KUNPENG cpu, here is my response: (Intel CPU will be a totally different story...) for 1 NPU testing, it is meet our expect. Large batch size hurts performance since kunpeng cpu shows a bad performance on detokenize. we are working on this issue. for 4 npus testing with small batch size, we will analyze the gap to MindIE. Acctually, I read my mindie code(I am not sure the code on my side is the same code you have), it has simpler post-processing, no dynamic memory allocation, and no streaming output. (our latest release is lmdeploy 0.6.3, dlinfer-ascend 0.1.2)
Will LMDeploy become a competitor to MindIE?
Yes, we have graph mode, and capture graph via torch.dynamo.
I tested the performance of the graph mode:
For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE. Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.
Thanks for your testing. Assuming you have a KUNPENG cpu, here is my response: (Intel CPU will be a totally different story...) for 1 NPU testing, it is meet our expect. Large batch size hurts performance since kunpeng cpu shows a bad performance on detokenize. we are working on this issue. for 4 npus testing with small batch size, we will analyze the gap to MindIE. Acctually, I read my mindie code(I am not sure the code on my side is the same code you have), it has simpler post-processing, no dynamic memory allocation, and no streaming output. (our latest release is lmdeploy 0.6.3, dlinfer-ascend 0.1.2)
Thanks for reply.
During testing, I also found a strange behavior: the graph mode will sometimes stuck for a while, with CPU single core 100% but nearly no NPU core usage.
Specifically, after the inference server started:
- I test decode phase first. First I test batch size 1, and then the server will stuck for a while, I have to re-test to get the expected result
- After that, I test batch size 10, and the server will also stuck for a while, I also have to re-test.
- After that, the test of batch size 100 and 200 is normal, the server will not stuck anymore.
- When the test of decode phase finishes, I begin to test prefill phase. The situation is somehow similar to the decode phase. First I test batch size 1 and the server will stuck for a while.
- After that, the test of batch size 10 and 50 is normal.
This phenomenon also happens in normal usage. After the server started:
- First I open a terminal and send a stream request using
curl, and the server will stuck for a while. - After that request finishes, re-send a request and it is processed normally.
- Then I open two terminals, send a stream request with a large output tokens, when the stream started but not finished, I send another stream request in another terminal. Now the first stream stucks, and after a while it resumes and the second stream starts.
Eager mode does not have such phenomenon.
Is this a bug or expected behavior?
Will LMDeploy become a competitor to MindIE?
Yes, we have graph mode, and capture graph via torch.dynamo.
I tested the performance of the graph mode:
For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE. Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.
Thanks for your testing. Assuming you have a KUNPENG cpu, here is my response: (Intel CPU will be a totally different story...) for 1 NPU testing, it is meet our expect. Large batch size hurts performance since kunpeng cpu shows a bad performance on detokenize. we are working on this issue. for 4 npus testing with small batch size, we will analyze the gap to MindIE. Acctually, I read my mindie code(I am not sure the code on my side is the same code you have), it has simpler post-processing, no dynamic memory allocation, and no streaming output. (our latest release is lmdeploy 0.6.3, dlinfer-ascend 0.1.2)
Thanks for reply.
During testing, I also found a strange behavior: the graph mode will sometimes stuck for a while, with CPU single core 100% but nearly no NPU core usage.
Specifically, after the inference server started:
- I test decode phase first. First I test batch size 1, and then the server will stuck for a while, I have to re-test to get the expected result
- After that, I test batch size 10, and the server will also stuck for a while, I also have to re-test.
- After that, the test of batch size 100 and 200 is normal, the server will not stuck anymore.
- When the test of decode phase finishes, I begin to test prefill phase. The situation is somehow similar to the decode phase. First I test batch size 1 and the server will stuck for a while.
- After that, the test of batch size 10 and 50 is normal.
This phenomenon also happens in normal usage. After the server started:
- First I open a terminal and send a stream request using
curl, and the server will stuck for a while.- After that request finishes, re-send a request and it is processed normally.
- Then I open two terminals, send a stream request with a large output tokens, when the stream started but not finished, I send another stream request in another terminal. Now the first stream stucks, and after a while it resumes and the second stream starts.
Eager mode does not have such phenomenon.
Is this a bug or expected behavior?
sorry for my late respose. The "stuck for a while" is called warmup which only occurs in graph mode. In warmup phase, pytorch code is compiled for calling ascend toolkits. After warmup phase, we call compiled funciton to boost performance.
但是triton这个库在aarch64上没有提供预编译好的包,自行编译也失败了。
但是triton这个库在aarch64上没有提供预编译好的包,自行编译也失败了。
请问,有人遇到ValueError: xpu is not available, you should use device="cpu" instead的错误嘛? 我使用的是RC1,910B2C