ollama [Ascend ] add ascend npu support

It's a draft for ascend npu support, It can get gpu info for npu, and need to be optimization

fix: https://github.com/ollama/ollama/issues/5315

The pre-builded ollama that support Huawei Atlas 800 A2 series and Atlas 300I Duo as the backend can be obtained from the following: Pre-builded ollama ENV: - Arch: linux/arm64 - CANN: 8.1.RC1, - ollama and llama.cpp base code: 2025.5.26

Docker Image: Atlas 800 A2: docker pull leopony/ollama-cann-atlas-a2:latest docker running command example:

docker run \
    --name ollama \
    --device /dev/davinci0 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -p 11434:11434  \
    -it leopony/ollama-cann-atlas-a2:latest /bin/bash

Atlas 300I Duo: docker pull leopony/ollama-cann-300i-duo docker running command example:

docker run \
    --name ollama \
    --device /dev/davinci0 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -p 11434:11434  \
    -it leopony/ollama-cann-300i-duo /bin/bash

Bianary tar: linux-aarch64 Atlas 800 A2: https://github.com/leo-pony/ollama/blob/ollama_bin/ollama-linux-arm64-cann-atlas-a2.tgz linux-aarch64 Atlas 300I Duo: https://github.com/leo-pony/ollama/blob/ollama_bin/ollama-linux-arm64-cann-300i-duo.tgz

Manully build guide: The build step is same with ollama, details as following: 1) Binary build: Cd the ollama project directory: For Atlas 800 A2:

cmake --preset 'CANN Atlas 800 A2' 
cmake --build --parallel --preset 'CANN Atlas 800 A2' 
cmake --install build --component CANN

export ARG GOFLAGS="'-ldflags=-w -s'"
export ENV CGO_ENABLED=1
go build -trimpath -buildmode=pie -o /bin/ollama .

For Atlas 300I Duo：

cmake --preset 'CANN Atlas 300I Duo' 
cmake --build --parallel --preset 'CANN Atlas 300I Duo' 
cmake --install build --component CANN

export ARG GOFLAGS="'-ldflags=-w -s'"
export ENV CGO_ENABLED=1
go build -trimpath -buildmode=pie -o /bin/ollama .

2) Docker build: If need net proxy, config docker proxy with guide: https://docs.docker.com/engine/daemon/proxy/ Cd the ollama project directory: sh -x ./scripts/build_docker.sh

3) Ollama CANN release packages build: If need net proxy, config docker proxy with guide: https://docs.docker.com/engine/daemon/proxy/ Cd the ollama project directory: sh -x ./scripts/build_linux.sh

Jul 23 '24 11:07 zhongTao99

@zhongTao99 Do you have Huawei hardware you can help ship to us for testing?

Jul 23 '24 15:07 mchiang0610

@mchiang0610

I can try my best to apply for a test environment for ollama, but this will take a lot of time.

I have verified this submission in an environment with Ascend hardware, and will post the verification process later.

Jul 24 '24 13:07 zhongTao99

origin

Jul 29 '24 03:07 zhongTao99

@mchiang0610 I have physical machine with Ascend NPU, I can share it with you. But the resource is very limited, there're some developer are working together on it. I can provide a pure ubuntu docker container with two NPU cards. Is it satisfy your request? If it's ok for you, please send me your public key and ip address. [email protected]

Jul 30 '24 12:07 hipudding

Can you package a beta version? I'd like to test it out

You can checkout this PR and follow develop guide to build by yourself.

Aug 02 '24 03:08 hipudding

I got this error when executing ollama run.

and it seems no ascend dir under runners.

uname -a
Linux hua-docker 4.19.90-vhulk2211.3.0.h1804.eulerosv2r10.aarch64 #1 SMP Mon Jun 3 18:15:36 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

echo $ASCEND_HOME_PATH
/home/hua/Ascend/ascend-toolkit/latest

using non-root user.

In docker container.

Aug 08 '24 02:08 hipudding

@zhongTao99 I am using Huawei's NPU，I'm waiting for this issue to merge. Please try again. Thank you very much

Aug 12 '24 14:08 seawenc

I used the following commands to compile ollama.

# git clone https://github.com/zhongTao99/ollama.git
# cd ollama//llm/generate
ollama/llm/generate# bash gen_linux.sh

Then I got this. 77bc7ebca6598088de7e60d9c3bfe5a7

Then I started the ollama service using the generated binary file.

ollama/llm/generate#  cd ../..
ollama# go build
ollama# ollama serve

f0769521462a2b6aef6e1f14c37abc3b

Then I ran the qwen:0.5b model, but it's slower than running on the CPU. 32442584d6138ca46ec880477878fcfd

And the utilization rate of the AICORE in the NPU is zero. 2532fe65709ce69ad96cd82fc2b8aceb

Driver version is 24.1.rc1 CANN version is 8.0.rc2

Please what should I do?

Aug 13 '24 10:08 AspartameJ

@AspartameJ I saw your log. The model has been split into more than 300 graphs. Which means it need copy data between these graphs. For 0.5b model, you can use one NPU instead of all avaliable NPUs.

Please try to use llama-cli from llama.cpp to confirm the inference engine.

llama-cli -m path/to/model -ngl 32 -sm none -p "some questions."

I tried qwen2 1.5b(mul_mat operator is not support 0.5b currently), speed is about 20 tokens/s.

Aug 13 '24 11:08 hipudding

@AspartameJ I saw your log. The model has been split into more than 300 graphs. Which means it need copy data between these graphs. For 0.5b model, you can use one NPU instead of all avaliable NPUs.

Please try to use llama-cli from llama.cpp to confirm the inference engine.

llama-cli -m path/to/model -ngl 32 -sm none -p "some questions."

I tried qwen2 1.5b(mul_mat operator is not support 0.5b currently), speed is about 20 tokens/s.

Thanks, it worked. I tested the qwen2-7b-instruct-fp16.gguf model on a single NPU, achieving a speed of 12.78 tokens per second. Previously, I was testing a q4_0 quantized model, and it seems that currently, only models quantized to fp16 are supported.

Aug 14 '24 06:08 AspartameJ

@AspartameJ I saw your log. The model has been split into more than 300 graphs. Which means it need copy data between these graphs. For 0.5b model, you can use one NPU instead of all avaliable NPUs. Please try to use llama-cli from llama.cpp to confirm the inference engine. llama-cli -m path/to/model -ngl 32 -sm none -p "some questions." I tried qwen2 1.5b(mul_mat operator is not support 0.5b currently), speed is about 20 tokens/s.

Thanks, it worked. I tested the qwen2-7b-instruct-fp16.gguf model on a single NPU, achieving a speed of 12.78 tokens per second. Previously, I was testing a q4_0 quantized model, and it seems that currently, only models quantized to fp16 are supported.

For q8 and q4, mul_mat use a different operator, which has some limits, some models such as qwen2, mul_mat shape is not satisfied operator's request. Which will be fixed later.

Aug 15 '24 01:08 hipudding

@AspartameJ I saw your log. The model has been split into more than 300 graphs. Which means it need copy data between these graphs. For 0.5b model, you can use one NPU instead of all avaliable NPUs. Please try to use llama-cli from llama.cpp to confirm the inference engine. llama-cli -m path/to/model -ngl 32 -sm none -p "some questions." I tried qwen2 1.5b(mul_mat operator is not support 0.5b currently), speed is about 20 tokens/s.

Thanks, it worked. I tested the qwen2-7b-instruct-fp16.gguf model on a single NPU, achieving a speed of 12.78 tokens per second. Previously, I was testing a q4_0 quantized model, and it seems that currently, only models quantized to fp16 are supported.

For q8 and q4, mul_mat use a different operator, which has some limits, some models such as qwen2, mul_mat shape is not satisfied operator's request. Which will be fixed later.

Could you please send me a usable packaged file? I'm unable to access the external network on the Huawei server, and I don't have a compilation software environment here. Thanks! my email : [email protected] 万分感谢！

Aug 15 '24 06:08 naked34501

Could you please send me a usable packaged file? I'm unable to access the external network on the Huawei server, and I don't have a compilation software environment here. Thanks! my email : [email protected] 万分感谢！

I can't send email to you. It seems my email is blocked by 163. I'm using a docker container and there's still something wrong with the binary, If you are Huawei's employee, you can refer to zhongtao from 2012 dep for help, perhaps she can send you a package by welink.

Aug 16 '24 06:08 hipudding

My question is: do I have to build on Huawei NPU server? Will the files I build in normal linux work on the Huawei NPU server? I don't know how to build this project, I just want to use this service!

You can build this project with any linux server, make sure Acend tookkit is in your PATH. please follow develop guide to build by yourself.

Aug 16 '24 06:08 hipudding

I tried to install Ollama on 310P3. The installation, service startup, and model loading all worked fine, but I encountered an error during model inference. The model I'm using is qwen2-7b-instruct-fp16.gguf. Can you tell what might be the reason for this? <===============================================================================> The following are the error logs from both the client and the server side.

Aug 16 '24 17:08 AspartameJ

@AspartameJ 310p is not support by Ollama's inference engine(llama.cpp) currently.

Aug 19 '24 02:08 hipudding

I tried downloading this PR and compiled it on my server. After successfully compiling, I used the “ollama serve” command in my folder, but it gave me the error ”ollama: command not found.” Then I tried adding the folder to the system path, and when I ran ”ollama serve” again, I encountered the following error. Could you please help me check what the issue might be? 微信截图_20240822134622

Aug 22 '24 05:08 naked34501

I tried downloading this PR and compiled it on my server. After successfully compiling, I used the “ollama serve” command in my folder, but it gave me the error ”ollama: command not found.” Then I tried adding the folder to the system path, and when I ran ”ollama serve” again, I encountered the following error. Could you please help me check what the issue might be?

You can use npu-smi info to check the card version. Currently, llama.cpp supports Ascend only on the 910B3 card, so ollama in this pr supports NPU only on the 910B3

Aug 23 '24 03:08 zhongTao99

There are symbolic links in the ascend-toolkit path, which causes ambiguity in relative paths. The symbolic links in libPath need to be removed before performing a Join operation.

Please add libPath, _ = filepath.EvalSymlinks(libPath) before tmp = filepath.Join(filepath.Dir(libPath), tmp)

Aug 28 '24 08:08 hipudding

There are symbolic links in the ascend-toolkit path, which causes ambiguity in relative paths. The symbolic links in libPath need to be removed before performing a Join operation.

Please add libPath, _ = filepath.EvalSymlinks(libPath) before tmp = filepath.Join(filepath.Dir(libPath), tmp)

fixed

Aug 28 '24 09:08 zhongTao99

@jmorganca @dhiltgen Could you please review this PR and give some suggestions for us? Thank you!

Sep 03 '24 02:09 wjunLu

@zhongTao99 I pulled your branch and tried to compile it, but it failed. It initially prompted with "ERROR Unexpected distro", which seems to be caused by the rh_linux_deps.sh script not matching the openEuler system. I modified this part and installed the gcc for openEuler (https://www.hikunpeng.ru/zh/developer/devkit/compiler/gcc).

However, when I tried to install again, another error occurred:

Could you give me some advice?

Sep 12 '24 06:09 MeiK2333

https://www.hikunpeng.ru/zh/developer/devkit/compiler/gcc

I think the it's due to GCC version. Please use GCC 11.4 or higher version and try again.

Sep 12 '24 08:09 hipudding

@zhongTao99 Do you have Huawei hardware you can help ship to us for testing?

We have machines with Ascend NPU for testing, and feel free to contact us by email ([email protected]) for community use. In addition, we are considering providing machines to the community for Ascend-related CI in the future.

Oct 09 '24 01:10 MengqingCao

@zhongTao99 Do you have Huawei hardware you can help ship to us for testing?

We have machines with Ascend NPU for testing, and feel free to contact us by email ([email protected]) for community use. In addition, we are considering providing machines to the community for Ascend-related CI in the future.

Hi @mchiang0610 !

Since the ascend NPU environment is available for testing, would you please to test this PR using the provided resources and left your suggestions?

As above said, we share provide some ascend machines to support ascend-NPU CI for this project (i.e., ollama).

Oct 12 '24 03:10 wjunLu

https://www.hikunpeng.ru/zh/developer/devkit/compiler/gcc

I think the it's due to GCC version. Please use GCC 11.4 or higher version and try again.

QQ_1728741858308

@hipudding I upgraded to gcc14, but the issue still persists.

Oct 12 '24 14:10 MeiK2333

https://www.hikunpeng.ru/zh/developer/devkit/compiler/gcc

I think the it's due to GCC version. Please use GCC 11.4 or higher version and try again.

@hipudding I upgraded to gcc14, but the issue still persists.

The issue was caused by my not modifying the version of gcc bound to cmake.

Oct 13 '24 01:10 MeiK2333

I am using Huawei Cloud ModelArts Notebook with Ascend: 1*ascend-snt9b1|ARM: 24-core 192GB.

When using qwen2:1.5b, the output is faster, but the npu usage is not as high as the cpu.
When using qwen2:7b, the output is much slower, with high cpu usage and low npu usage.
When using qwen2.5:1.5b / qwen2.5:7b, the npu is not utilized at all, resulting in extremely slow generation.

qwen2

14 characters/s

qwen2你好 qwen2time qwen2

qwen2.5

0.08 characters/s!

nihao time qwen2 5

Oct 13 '24 04:10 MeiK2333

I am using Huawei Cloud ModelArts Notebook with Ascend: 1*ascend-snt9b1|ARM: 24-core 192GB.

When using qwen2:1.5b, the output is faster, but the npu usage is not as high as the cpu.

When using qwen2:7b, the output is much slower, with high cpu usage and low npu usage.

When using qwen2.5:1.5b / qwen2.5:7b, the npu is not utilized at all, resulting in extremely slow generation.

qwen2

14 characters/s

qwen2.5

0.08 characters/s!

This difference is caused by the difference in model processing in llama.cpp.

Oct 14 '24 01:10 zhongTao99

@MeiK2333 It's due to llama.cpp inference engine. First, please confirm how many graphs are create when inferencing a model. If graphs counts are very big(say more than 100), it means there's some operators that are not support by NPU, it will fail back to CPU.

And as we test, NPU usage usually below 40%, we are working on the performance issue now.

Oct 14 '24 01:10 hipudding