[Ascend ] add ascend npu support
It's a draft for ascend npu support, It can get gpu info for npu, and need to be optimization
fix: https://github.com/ollama/ollama/issues/5315
The pre-builded ollama that support Huawei Atlas 800 A2 series and Atlas 300I Duo as the backend can be obtained from the following: Pre-builded ollama ENV: - Arch: linux/arm64 - CANN: 8.1.RC1, - ollama and llama.cpp base code: 2025.5.26
Docker Image:
Atlas 800 A2:
docker pull leopony/ollama-cann-atlas-a2:latest
docker running command example:
docker run \
--name ollama \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-p 11434:11434 \
-it leopony/ollama-cann-atlas-a2:latest /bin/bash
Atlas 300I Duo:
docker pull leopony/ollama-cann-300i-duo
docker running command example:
docker run \
--name ollama \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-p 11434:11434 \
-it leopony/ollama-cann-300i-duo /bin/bash
Bianary tar: linux-aarch64 Atlas 800 A2: https://github.com/leo-pony/ollama/blob/ollama_bin/ollama-linux-arm64-cann-atlas-a2.tgz linux-aarch64 Atlas 300I Duo: https://github.com/leo-pony/ollama/blob/ollama_bin/ollama-linux-arm64-cann-300i-duo.tgz
Manully build guide: The build step is same with ollama, details as following: 1) Binary build: Cd the ollama project directory: For Atlas 800 A2:
cmake --preset 'CANN Atlas 800 A2'
cmake --build --parallel --preset 'CANN Atlas 800 A2'
cmake --install build --component CANN
export ARG GOFLAGS="'-ldflags=-w -s'"
export ENV CGO_ENABLED=1
go build -trimpath -buildmode=pie -o /bin/ollama .
For Atlas 300I Duo:
cmake --preset 'CANN Atlas 300I Duo'
cmake --build --parallel --preset 'CANN Atlas 300I Duo'
cmake --install build --component CANN
export ARG GOFLAGS="'-ldflags=-w -s'"
export ENV CGO_ENABLED=1
go build -trimpath -buildmode=pie -o /bin/ollama .
2) Docker build: If need net proxy, config docker proxy with guide: https://docs.docker.com/engine/daemon/proxy/ Cd the ollama project directory: sh -x ./scripts/build_docker.sh
3) Ollama CANN release packages build: If need net proxy, config docker proxy with guide: https://docs.docker.com/engine/daemon/proxy/ Cd the ollama project directory: sh -x ./scripts/build_linux.sh
@zhongTao99 Do you have Huawei hardware you can help ship to us for testing?
@mchiang0610
I can try my best to apply for a test environment for ollama, but this will take a lot of time.
I have verified this submission in an environment with Ascend hardware, and will post the verification process later.
@mchiang0610 I have physical machine with Ascend NPU, I can share it with you. But the resource is very limited, there're some developer are working together on it. I can provide a pure ubuntu docker container with two NPU cards. Is it satisfy your request? If it's ok for you, please send me your public key and ip address. [email protected]
Can you package a beta version? I'd like to test it out
You can checkout this PR and follow develop guide to build by yourself.
I got this error when executing ollama run.
and it seems no ascend dir under runners.
uname -a
Linux hua-docker 4.19.90-vhulk2211.3.0.h1804.eulerosv2r10.aarch64 #1 SMP Mon Jun 3 18:15:36 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
echo $ASCEND_HOME_PATH
/home/hua/Ascend/ascend-toolkit/latest
using non-root user.
In docker container.
@zhongTao99 I am using Huawei's NPU,I'm waiting for this issue to merge. Please try again. Thank you very much
I used the following commands to compile ollama.
# git clone https://github.com/zhongTao99/ollama.git
# cd ollama//llm/generate
ollama/llm/generate# bash gen_linux.sh
Then I got this.
Then I started the ollama service using the generated binary file.
ollama/llm/generate# cd ../..
ollama# go build
ollama# ollama serve
Then I ran the qwen:0.5b model, but it's slower than running on the CPU.
And the utilization rate of the AICORE in the NPU is zero.
Driver version is 24.1.rc1 CANN version is 8.0.rc2
Please what should I do?
@AspartameJ I saw your log. The model has been split into more than 300 graphs. Which means it need copy data between these graphs. For 0.5b model, you can use one NPU instead of all avaliable NPUs.
Please try to use llama-cli from llama.cpp to confirm the inference engine.
llama-cli -m path/to/model -ngl 32 -sm none -p "some questions."
I tried qwen2 1.5b(mul_mat operator is not support 0.5b currently), speed is about 20 tokens/s.
@AspartameJ I saw your log. The model has been split into more than 300 graphs. Which means it need copy data between these graphs. For 0.5b model, you can use one NPU instead of all avaliable NPUs.
Please try to use llama-cli from llama.cpp to confirm the inference engine.
llama-cli -m path/to/model -ngl 32 -sm none -p "some questions."
I tried qwen2 1.5b(mul_mat operator is not support 0.5b currently), speed is about 20 tokens/s.
Thanks, it worked. I tested the qwen2-7b-instruct-fp16.gguf model on a single NPU, achieving a speed of 12.78 tokens per second. Previously, I was testing a q4_0 quantized model, and it seems that currently, only models quantized to fp16 are supported.
@AspartameJ I saw your log. The model has been split into more than 300 graphs. Which means it need copy data between these graphs. For 0.5b model, you can use one NPU instead of all avaliable NPUs. Please try to use llama-cli from llama.cpp to confirm the inference engine. llama-cli -m path/to/model -ngl 32 -sm none -p "some questions." I tried qwen2 1.5b(mul_mat operator is not support 0.5b currently), speed is about 20 tokens/s.
Thanks, it worked. I tested the qwen2-7b-instruct-fp16.gguf model on a single NPU, achieving a speed of 12.78 tokens per second. Previously, I was testing a q4_0 quantized model, and it seems that currently, only models quantized to fp16 are supported.
For q8 and q4, mul_mat use a different operator, which has some limits, some models such as qwen2, mul_mat shape is not satisfied operator's request. Which will be fixed later.
@AspartameJ I saw your log. The model has been split into more than 300 graphs. Which means it need copy data between these graphs. For 0.5b model, you can use one NPU instead of all avaliable NPUs. Please try to use llama-cli from llama.cpp to confirm the inference engine. llama-cli -m path/to/model -ngl 32 -sm none -p "some questions." I tried qwen2 1.5b(mul_mat operator is not support 0.5b currently), speed is about 20 tokens/s.
Thanks, it worked. I tested the qwen2-7b-instruct-fp16.gguf model on a single NPU, achieving a speed of 12.78 tokens per second. Previously, I was testing a q4_0 quantized model, and it seems that currently, only models quantized to fp16 are supported.
For q8 and q4, mul_mat use a different operator, which has some limits, some models such as qwen2, mul_mat shape is not satisfied operator's request. Which will be fixed later.
Could you please send me a usable packaged file? I'm unable to access the external network on the Huawei server, and I don't have a compilation software environment here. Thanks! my email : [email protected] 万分感谢!
Could you please send me a usable packaged file? I'm unable to access the external network on the Huawei server, and I don't have a compilation software environment here. Thanks! my email : [email protected] 万分感谢!
I can't send email to you. It seems my email is blocked by 163. I'm using a docker container and there's still something wrong with the binary, If you are Huawei's employee, you can refer to zhongtao from 2012 dep for help, perhaps she can send you a package by welink.
My question is: do I have to build on Huawei NPU server? Will the files I build in normal linux work on the Huawei NPU server? I don't know how to build this project, I just want to use this service!
You can build this project with any linux server, make sure Acend tookkit is in your PATH. please follow develop guide to build by yourself.
I tried to install Ollama on 310P3. The installation, service startup, and model loading all worked fine, but I encountered an error during model inference. The model I'm using is qwen2-7b-instruct-fp16.gguf. Can you tell what might be the reason for this?
<===============================================================================>
The following are the error logs from both the client and the server side.
@AspartameJ 310p is not support by Ollama's inference engine(llama.cpp) currently.
I tried downloading this PR and compiled it on my server. After successfully compiling, I used the “ollama serve” command in my folder, but it gave me the error ”ollama: command not found.” Then I tried adding the folder to the system path, and when I ran ”ollama serve” again, I encountered the following error. Could you please help me check what the issue might be?
I tried downloading this PR and compiled it on my server. After successfully compiling, I used the “ollama serve” command in my folder, but it gave me the error ”ollama: command not found.” Then I tried adding the folder to the system path, and when I ran ”ollama serve” again, I encountered the following error. Could you please help me check what the issue might be?
I tried downloading this PR and compiled it on my server. After successfully compiling, I used the “ollama serve” command in my folder, but it gave me the error ”ollama: command not found.” Then I tried adding the folder to the system path, and when I ran ”ollama serve” again, I encountered the following error. Could you please help me check what the issue might be?
You can use npu-smi info to check the card version. Currently, llama.cpp supports Ascend only on the 910B3 card, so ollama in this pr supports NPU only on the 910B3
There are symbolic links in the ascend-toolkit path, which causes ambiguity in relative paths. The symbolic links in libPath need to be removed before performing a Join operation.
Please add libPath, _ = filepath.EvalSymlinks(libPath) before tmp = filepath.Join(filepath.Dir(libPath), tmp)
There are symbolic links in the ascend-toolkit path, which causes ambiguity in relative paths. The symbolic links in libPath need to be removed before performing a Join operation.
Please add
libPath, _ = filepath.EvalSymlinks(libPath)beforetmp = filepath.Join(filepath.Dir(libPath), tmp)
fixed
@jmorganca @dhiltgen Could you please review this PR and give some suggestions for us? Thank you!
@zhongTao99 I pulled your branch and tried to compile it, but it failed. It initially prompted with "ERROR Unexpected distro", which seems to be caused by the rh_linux_deps.sh script not matching the openEuler system. I modified this part and installed the gcc for openEuler (https://www.hikunpeng.ru/zh/developer/devkit/compiler/gcc).
However, when I tried to install again, another error occurred:
Could you give me some advice?
https://www.hikunpeng.ru/zh/developer/devkit/compiler/gcc
I think the it's due to GCC version. Please use GCC 11.4 or higher version and try again.
@zhongTao99 Do you have Huawei hardware you can help ship to us for testing?
We have machines with Ascend NPU for testing, and feel free to contact us by email ([email protected]) for community use. In addition, we are considering providing machines to the community for Ascend-related CI in the future.
@zhongTao99 Do you have Huawei hardware you can help ship to us for testing?
We have machines with Ascend NPU for testing, and feel free to contact us by email ([email protected]) for community use. In addition, we are considering providing machines to the community for Ascend-related CI in the future.
Hi @mchiang0610 !
Since the ascend NPU environment is available for testing, would you please to test this PR using the provided resources and left your suggestions?
As above said, we share provide some ascend machines to support ascend-NPU CI for this project (i.e., ollama).
https://www.hikunpeng.ru/zh/developer/devkit/compiler/gcc
I think the it's due to GCC version. Please use GCC 11.4 or higher version and try again.
@hipudding I upgraded to gcc14, but the issue still persists.
https://www.hikunpeng.ru/zh/developer/devkit/compiler/gcc
I think the it's due to GCC version. Please use GCC 11.4 or higher version and try again.
@hipudding I upgraded to gcc14, but the issue still persists.
The issue was caused by my not modifying the version of gcc bound to cmake.
I am using Huawei Cloud ModelArts Notebook with Ascend: 1*ascend-snt9b1|ARM: 24-core 192GB.
- When using qwen2:1.5b, the output is faster, but the npu usage is not as high as the cpu.
- When using qwen2:7b, the output is much slower, with high cpu usage and low npu usage.
- When using qwen2.5:1.5b / qwen2.5:7b, the npu is not utilized at all, resulting in extremely slow generation.
qwen2
14 characters/s
qwen2.5
0.08 characters/s!
I am using Huawei Cloud ModelArts Notebook with Ascend: 1*ascend-snt9b1|ARM: 24-core 192GB.
- When using qwen2:1.5b, the output is faster, but the npu usage is not as high as the cpu.
- When using qwen2:7b, the output is much slower, with high cpu usage and low npu usage.
- When using qwen2.5:1.5b / qwen2.5:7b, the npu is not utilized at all, resulting in extremely slow generation.
qwen2
14 characters/s
![]()
![]()
qwen2.5
0.08 characters/s!
![]()
![]()
This difference is caused by the difference in model processing in llama.cpp.
@MeiK2333 It's due to llama.cpp inference engine. First, please confirm how many graphs are create when inferencing a model. If graphs counts are very big(say more than 100), it means there's some operators that are not support by NPU, it will fail back to CPU.
And as we test, NPU usage usually below 40%, we are working on the performance issue now.




