llama.cpp [CANN] Add Ascend NPU backend

Ascend is a full-stack AI computing infrastructure for industry applications and services based on Huawei Ascend processors and software.

CANN (Compute Architecture of Neural Networks), developped by Huawei, is a heterogeneous computing architecture for AI.

This commit adding Ascend NPU as a new backend, which implements the following features:

Ascend NPU register;
Ascend NPU runtime(device memory, streams, events).
Part of GGML_OPs through aclnn library.
Introduce a new test file named test-backend-runtime, for testing runtime functionality.

@sa #6034

Mar 13 '24 07:03 hipudding

For those struggling to find FTW is CANN :

https://support.huaweicloud.com/intl/en-us/usermanual-cce/cce_10_0239.html

Great!

Mar 25 '24 05:03 phymbert

Good news! @ggerganov @slaren @phymbert, The most basic functions for this new backend is ready for review now. As I described in the issue(https://github.com/ggerganov/llama.cpp/issues/6034), this backend implemetation may be a lot of work, and I'd like to do it in steps.

Based on the reference to cuda's implementation, the basic functions of this backend is working now, I add some GGML_OPs (which is build-in in CANN package) and it pass the test(test-backend-ops).

More features will be submitted in independent PRs later. Which including:

more GGML_OPs.
quantization.
split tensor.
...

Considering that Ascend NPU is not so easy to obtain. Here's my screenshots of compilation and testing (I got two NPUs at hand):

Mar 28 '24 08:03 hipudding

I cannot comment on the CANN code, but the changes to the common files look good. However, I am not sure that there is any reason to merge a non-functional backend, especially considering that it is for hardware that does not seem to be publicly available. Currently, this backend does not seem to implement matrix multiplication.

Thank you very much for your review. Yes, this PR has not implemented all the features yet. Currently, only device access and some operators to verify these basic functionalities have been implemented. More operators are still under development, and mat-mul is also in progress, and mat-mul relies on quantization, which will be implemented after quantization. Ascend NPU is a publicly available hardware that can be purchased or used in virtual machine on Huawei Cloud. In China, Ascend NPU already has a considerable user base, especially among many Chinese internet companies. Many of them have already used Ascend NPU to build AI training or inference platforms. Due to high demand and limited production capacity, it may not be as convenient for individual developers to purchase Ascend NPU. However, I am very willing to donate an Ascend NPU machine to the llama.cpp community for running CI and other validation work. Currently, many popular AI projects support Ascend NPU as a hardware backend, such as PyTorch (through private use1), DeepSpeed, OpenCV, stable-diffusion-webui, and diffusers. Additionally, many other projects are also in development. We believe that llama.cpp is an excellent large language model inference engine, so we hope to prioritize its adaptation and attract more Ascend developers and users.

I agree that not merge this non-functional backend now, but wait for all main features have been implemented.

Thanks.

Mar 30 '24 14:03 hipudding

However, I am very willing to donate an Ascend NPU machine to the llama.cpp community for running CI and other validation work.

If there is a dedicated node with the necessary hardware, adding it to ggml-ci is a relatively simple task. It will run a collection of unit and integration tests on each commit and it will make integration much smoother.

I can either send configuration instructions, or if I can get SSH access I can login directly and set it up. Let me know

Mar 31 '24 08:03 ggerganov

However, I am very willing to donate an Ascend NPU machine to the llama.cpp community for running CI and other validation work.

If there is a dedicated node with the necessary hardware, adding it to ggml-ci is a relatively simple task. It will run a collection of unit and integration tests on each commit and it will make integration much smoother.

I can either send configuration instructions, or if I can get SSH access I can login directly and set it up. Let me know

Sure. I will.

Mar 31 '24 08:03 hipudding

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 208 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=23200.83ms p(95)=41014.79ms fails=, finish reason: stop=91 truncated=117
Prompt processing (pp): avg=270.32tk/s p(95)=815.7tk/s
Token generation (tg): avg=23.87tk/s p(95)=26.75tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=npu_support commit=c28ca5d94974584703bb3b41fbe68af7dbde1be8

prompt_tokens_seconds

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 208 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717517748 --> 1717518384
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 239.94, 239.94, 239.94, 239.94, 239.94, 283.97, 283.97, 283.97, 283.97, 283.97, 350.12, 350.12, 350.12, 350.12, 350.12, 444.1, 444.1, 444.1, 444.1, 444.1, 465.08, 465.08, 465.08, 465.08, 465.08, 465.08, 465.08, 465.08, 465.08, 465.08, 465.02, 465.02, 465.02, 465.02, 465.02, 462.79, 462.79, 462.79, 462.79, 462.79, 475.82, 475.82, 475.82, 475.82, 475.82, 485.92, 485.92, 485.92, 485.92, 485.92, 487.85, 487.85, 487.85, 487.85, 487.85, 509.67, 509.67, 509.67, 509.67, 509.67, 510.89, 510.89, 510.89, 510.89, 510.89, 552.04, 552.04, 552.04, 552.04, 552.04, 553.74, 553.74, 553.74, 553.74, 553.74, 553.74, 553.74, 553.74, 553.74, 553.74, 553.77, 553.77, 553.77, 553.77, 553.77, 556.25, 556.25, 556.25, 556.25, 556.25, 556.23, 556.23, 556.23, 556.23, 556.23, 566.5, 566.5, 566.5, 566.5, 566.5, 572.1, 572.1, 572.1, 572.1, 572.1, 576.43, 576.43, 576.43, 576.43, 576.43, 577.75, 577.75, 577.75, 577.75, 577.75, 589.64, 589.64, 589.64, 589.64, 589.64, 589.96, 589.96, 589.96, 589.96, 589.96, 590.54, 590.54, 590.54, 590.54, 590.54, 590.78, 590.78, 590.78, 590.78, 590.78, 607.07, 607.07, 607.07, 607.07, 607.07, 606.49, 606.49, 606.49, 606.49, 606.49, 610.96, 610.96, 610.96, 610.96, 610.96, 624.01, 624.01, 624.01, 624.01, 624.01, 624.29, 624.29, 624.29, 624.29, 624.29, 624.22, 624.22, 624.22, 624.22, 624.22, 623.15, 623.15, 623.15, 623.15, 623.15, 629.25, 629.25, 629.25, 629.25, 629.25, 638.12, 638.12, 638.12, 638.12, 638.12, 638.12, 638.12, 638.12, 638.12, 638.12, 635.08, 635.08, 635.08, 635.08, 635.08, 632.27, 632.27, 632.27, 632.27, 632.27, 630.73, 630.73, 630.73, 630.73, 630.73, 632.19, 632.19, 632.19, 632.19, 632.19, 632.6, 632.6, 632.6, 632.6, 632.6, 634.21, 634.21, 634.21, 634.21, 634.21, 634.81, 634.81, 634.81, 634.81, 634.81, 635.8, 635.8, 635.8, 635.8, 635.8, 635.57, 635.57, 635.57, 635.57, 635.57, 634.54, 634.54, 634.54, 634.54, 634.54, 634.11, 634.11, 634.11, 634.11, 634.11, 646.98, 646.98, 646.98, 646.98, 646.98, 646.53, 646.53, 646.53, 646.53, 646.53, 646.13, 646.13, 646.13, 646.13, 646.13, 644.05, 644.05, 644.05, 644.05, 644.05, 643.37, 643.37, 643.37, 643.37, 643.37, 646.48, 646.48, 646.48, 646.48, 646.48, 647.25, 647.25, 647.25, 647.25, 647.25, 647.06, 647.06, 647.06, 647.06, 647.06, 650.7, 650.7, 650.7, 650.7, 650.7, 650.51, 650.51, 650.51, 650.51, 650.51, 650.67, 650.67, 650.67, 650.67, 650.67, 651.25, 651.25, 651.25, 651.25, 651.25, 651.04, 651.04, 651.04, 651.04, 651.04, 651.04, 651.04, 651.04, 651.04]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 208 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717517748 --> 1717518384
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 31.08, 31.08, 31.08, 31.08, 31.08, 26.99, 26.99, 26.99, 26.99, 26.99, 27.01, 27.01, 27.01, 27.01, 27.01, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 20.65, 20.65, 20.65, 20.65, 20.65, 17.59, 17.59, 17.59, 17.59, 17.59, 17.15, 17.15, 17.15, 17.15, 17.15, 17.39, 17.39, 17.39, 17.39, 17.39, 17.72, 17.72, 17.72, 17.72, 17.72, 18.39, 18.39, 18.39, 18.39, 18.39, 18.41, 18.41, 18.41, 18.41, 18.41, 18.59, 18.59, 18.59, 18.59, 18.59, 18.67, 18.67, 18.67, 18.67, 18.67, 18.87, 18.87, 18.87, 18.87, 18.87, 19.18, 19.18, 19.18, 19.18, 19.18, 19.18, 19.18, 19.18, 19.18, 19.18, 19.41, 19.41, 19.41, 19.41, 19.41, 19.64, 19.64, 19.64, 19.64, 19.64, 19.78, 19.78, 19.78, 19.78, 19.78, 19.83, 19.83, 19.83, 19.83, 19.83, 19.84, 19.84, 19.84, 19.84, 19.84, 19.87, 19.87, 19.87, 19.87, 19.87, 19.87, 19.87, 19.87, 19.87, 19.87, 19.88, 19.88, 19.88, 19.88, 19.88, 19.84, 19.84, 19.84, 19.84, 19.84, 19.84, 19.84, 19.84, 19.84, 19.84, 19.77, 19.77, 19.77, 19.77, 19.77, 19.75, 19.75, 19.75, 19.75, 19.75, 19.63, 19.63, 19.63, 19.63, 19.63, 19.49, 19.49, 19.49, 19.49, 19.49, 19.32, 19.32, 19.32, 19.32, 19.32, 19.2, 19.2, 19.2, 19.2, 19.2, 19.02, 19.02, 19.02, 19.02, 19.02, 18.86, 18.86, 18.86, 18.86, 18.86, 18.86, 18.86, 18.86, 18.86, 18.86, 18.61, 18.61, 18.61, 18.61, 18.61, 18.36, 18.36, 18.36, 18.36, 18.36, 18.22, 18.22, 18.22, 18.22, 18.22, 17.98, 17.98, 17.98, 17.98, 17.98, 17.74, 17.74, 17.74, 17.74, 17.74, 17.68, 17.68, 17.68, 17.68, 17.68, 17.62, 17.62, 17.62, 17.62, 17.62, 17.63, 17.63, 17.63, 17.63, 17.63, 17.65, 17.65, 17.65, 17.65, 17.65, 17.69, 17.69, 17.69, 17.69, 17.69, 17.76, 17.76, 17.76, 17.76, 17.76, 17.82, 17.82, 17.82, 17.82, 17.82, 17.82, 17.82, 17.82, 17.82, 17.82, 17.77, 17.77, 17.77, 17.77, 17.77, 17.71, 17.71, 17.71, 17.71, 17.71, 17.59, 17.59, 17.59, 17.59, 17.59, 17.55, 17.55, 17.55, 17.55, 17.55, 17.58, 17.58, 17.58, 17.58, 17.58, 17.64, 17.64, 17.64, 17.64, 17.64, 17.68, 17.68, 17.68, 17.68, 17.68, 17.71, 17.71, 17.71, 17.71, 17.71, 17.73, 17.73, 17.73, 17.73, 17.73, 17.74, 17.74, 17.74, 17.74, 17.74, 17.76, 17.76, 17.76, 17.76, 17.76, 17.81, 17.81, 17.81, 17.81]

Details

kv_cache_usage_ratio

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 208 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717517748 --> 1717518384
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.33, 0.33, 0.33, 0.33, 0.33, 0.4, 0.4, 0.4, 0.4, 0.4, 0.47, 0.47, 0.47, 0.47, 0.47, 0.49, 0.49, 0.49, 0.49, 0.49, 0.31, 0.31, 0.31, 0.31, 0.31, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.23, 0.23, 0.23, 0.23, 0.23, 0.19, 0.19, 0.19, 0.19, 0.19, 0.23, 0.23, 0.23, 0.23, 0.23, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.26, 0.26, 0.26, 0.26, 0.26, 0.24, 0.24, 0.24, 0.24, 0.24, 0.3, 0.3, 0.3, 0.3, 0.3, 0.34, 0.34, 0.34, 0.34, 0.34, 0.3, 0.3, 0.3, 0.3, 0.3, 0.34, 0.34, 0.34, 0.34, 0.34, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.37, 0.37, 0.37, 0.37, 0.37, 0.44, 0.44, 0.44, 0.44, 0.44, 0.5, 0.5, 0.5, 0.5, 0.5, 0.46, 0.46, 0.46, 0.46, 0.46, 0.38, 0.38, 0.38, 0.38, 0.38, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.24, 0.24, 0.24, 0.24, 0.24, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.34, 0.34, 0.34, 0.34, 0.34, 0.38, 0.38, 0.38, 0.38, 0.38, 0.42, 0.42, 0.42, 0.42, 0.42, 0.39, 0.39, 0.39, 0.39, 0.39, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.26, 0.26, 0.26, 0.26]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 208 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717517748 --> 1717518384
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0]

Apr 10 '24 04:04 github-actions[bot]

I failed to run models with this branch, with CANN version 8.0.RC2.alpha001:

Log start
main: build = 2749 (f1bde5d)
main: built with cc (GCC) 7.3.0 for aarch64-linux-gnu
main: seed  = 1714027412
llama_model_loader: loaded meta data with 19 key-value pairs and 387 tensors from /data/Qwen1.5-7B-Chat-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen1.5-7B-Chat
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 32
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 32
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 7.72 B
llm_load_print_meta: model size       = 14.38 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Qwen1.5-7B-Chat
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1187.00 MiB
llm_load_tensors:      CANN0 buffer size = 13541.52 MiB
......................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CANN0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      CANN0 compute buffer size =   304.75 MiB
llama_new_context_with_model:        CPU compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1126
llama_new_context_with_model: graph splits = 2
CANN error: EZ9903: 2024-04-25-14:43:36.365.943 OP tiling_funcs NULL
        Solution: In this scenario, collect the plog when the fault occurs and locate the fault based on the plog.
        TraceBack (most recent call last):
        InitTilingParseCtx failed
        Kernel Run failed. opType: 10, Add
        launch failed for Add, errno:361001.

  current device: 0, in function aclnn_ones at /home/abc/llama.cpp/ggml-cann/aclnn_ops.cpp:852
  aclnnInplaceAdds(workspaceAddr, workspaceSize, executor, ctx.stream())
GGML_ASSERT: /home/abc/llama.cpp/ggml-cann.cpp:24: !"CANN error"
[1]    4088322 abort (core dumped)  ASCEND_RT_VISIBLE_DEVICES=1 ./main -m /data/Qwen1.5-7B-Chat-f16.gguf -ngl 100

Apr 25 '24 06:04 huyz-git

I failed to run models with this branch, with CANN version 8.0.RC2.alpha001:

Log start
main: build = 2749 (f1bde5d)
main: built with cc (GCC) 7.3.0 for aarch64-linux-gnu
main: seed  = 1714027412
llama_model_loader: loaded meta data with 19 key-value pairs and 387 tensors from /data/Qwen1.5-7B-Chat-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen1.5-7B-Chat
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 32
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 32
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 7.72 B
llm_load_print_meta: model size       = 14.38 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Qwen1.5-7B-Chat
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1187.00 MiB
llm_load_tensors:      CANN0 buffer size = 13541.52 MiB
......................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CANN0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      CANN0 compute buffer size =   304.75 MiB
llama_new_context_with_model:        CPU compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1126
llama_new_context_with_model: graph splits = 2
CANN error: EZ9903: 2024-04-25-14:43:36.365.943 OP tiling_funcs NULL
        Solution: In this scenario, collect the plog when the fault occurs and locate the fault based on the plog.
        TraceBack (most recent call last):
        InitTilingParseCtx failed
        Kernel Run failed. opType: 10, Add
        launch failed for Add, errno:361001.

  current device: 0, in function aclnn_ones at /home/abc/llama.cpp/ggml-cann/aclnn_ops.cpp:852
  aclnnInplaceAdds(workspaceAddr, workspaceSize, executor, ctx.stream())
GGML_ASSERT: /home/abc/llama.cpp/ggml-cann.cpp:24: !"CANN error"
[1]    4088322 abort (core dumped)  ASCEND_RT_VISIBLE_DEVICES=1 ./main -m /data/Qwen1.5-7B-Chat-f16.gguf -ngl 100

This bug is due to not init before using CANN. The latest version has fix this. But still, it can't be use right now, not all ops are implemented.

Apr 29 '24 12:04 hipudding

@hipudding Great work.

I have a server with 8 *910b, can I test this PR on the 910b?

May 14 '24 06:05 jeejeelee

@hipudding Great work.

I have a server with 8 *910b, can I test this PR on the 910b?

Yes, you can test operators on 910b. But it still can't inference LLM now.

mkdir build cd build cmake .. -DCMAKE_BUILD_TYPE=debug -DLLAMA_CANN=on && make -j

./bin/test-backend-ops test -b CANN0 -o {OP_NAME}

May 14 '24 08:05 hipudding

@hipudding Great work. I have a server with 8 *910b, can I test this PR on the 910b?

Yes, you can test operators on 910b. But it still can't inference LLM now.

mkdir build cd build cmake .. -DCMAKE_BUILD_TYPE=debug -DLLAMA_CANN=on && make -j

./bin/test-backend-ops test -b CANN0 -o {OP_NAME}

I got this:

./test-backend-ops test -b CANN1 -o ARGSORT
ggml_backend_register: registered backend CPU
ggml_backend_register: registered backend CANN0
ggml_backend_register: registered backend CANN1
ggml_backend_register: registered backend CANN2
ggml_backend_register: registered backend CANN3
ggml_backend_register: registered backend CANN4
ggml_backend_register: registered backend CANN5
ggml_backend_register: registered backend CANN6
ggml_backend_register: registered backend CANN7
Testing 9 backends

Backend 1/9 (CPU)
  Skipping
Backend 2/9 (CANN0)
  Skipping
Backend 3/9 (CANN1)
  Backend name: CANN1
  ARGSORT(type=f32,ne=[8,1,1,1],order=0): OK
  ARGSORT(type=f32,ne=[16,10,10,10],order=0): GGML_ASSERT: /home/abc/llama.cpp/ggml-cann.cpp:328: size == ggml_nbytes(tensor)
[1]    3372786 abort (core dumped)  ./test-backend-ops test -b CANN1 -o ARGSORT

May 14 '24 09:05 huyz-git

@hipudding Great work. I have a server with 8 *910b, can I test this PR on the 910b?

Yes, you can test operators on 910b. But it still can't inference LLM now.

mkdir build cd build cmake .. -DCMAKE_BUILD_TYPE=debug -DLLAMA_CANN=on && make -j

./bin/test-backend-ops test -b CANN0 -o {OP_NAME}

Thank you for your reply. When will it be possible for me to test the LLM inference, could you please provide a date?

May 14 '24 09:05 jeejeelee

@hipudding Great work. I have a server with 8 *910b, can I test this PR on the 910b?

Yes, you can test operators on 910b. But it still can't inference LLM now. mkdir build cd build cmake .. -DCMAKE_BUILD_TYPE=debug -DLLAMA_CANN=on && make -j ./bin/test-backend-ops test -b CANN0 -o {OP_NAME}

I got this:

./test-backend-ops test -b CANN1 -o ARGSORT
ggml_backend_register: registered backend CPU
ggml_backend_register: registered backend CANN0
ggml_backend_register: registered backend CANN1
ggml_backend_register: registered backend CANN2
ggml_backend_register: registered backend CANN3
ggml_backend_register: registered backend CANN4
ggml_backend_register: registered backend CANN5
ggml_backend_register: registered backend CANN6
ggml_backend_register: registered backend CANN7
Testing 9 backends

Backend 1/9 (CPU)
  Skipping
Backend 2/9 (CANN0)
  Skipping
Backend 3/9 (CANN1)
  Backend name: CANN1
  ARGSORT(type=f32,ne=[8,1,1,1],order=0): OK
  ARGSORT(type=f32,ne=[16,10,10,10],order=0): GGML_ASSERT: /home/abc/llama.cpp/ggml-cann.cpp:328: size == ggml_nbytes(tensor)
[1]    3372786 abort (core dumped)  ./test-backend-ops test -b CANN1 -o ARGSORT

Yes, there does exist many bugs now. Because it under developing and not stable now. Not all commits are passing test case. But it will be done after all basic operators are ready.

May 14 '24 09:05 hipudding

@hipudding Great work. I have a server with 8 *910b, can I test this PR on the 910b?

Yes, you can test operators on 910b. But it still can't inference LLM now. mkdir build cd build cmake .. -DCMAKE_BUILD_TYPE=debug -DLLAMA_CANN=on && make -j ./bin/test-backend-ops test -b CANN0 -o {OP_NAME}

Thank you for your reply. When will it be possible for me to test the LLM inference, could you please provide a date?

Maybe after June, maybe even later. Not including all data types, performance optimizations and multi-card inference.

May 14 '24 09:05 hipudding

@hipudding I tried this:

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=debug -DLLAMA_CANN=on && make -j

got:

/home/ma-user/llama.cpp/ggml-impl.h:283:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^

I used OpenI (https://openi.pcl.ac.cn/) 910a / aarch64 environment.

May 23 '24 03:05 leopeng1995

@hipudding I tried this:

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=debug -DLLAMA_CANN=on && make -j

got:

/home/ma-user/llama.cpp/ggml-impl.h:283:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^

I used OpenI (https://openi.pcl.ac.cn/) 910a / aarch64 environment.

Try upgrade your gcc version.

May 23 '24 03:05 huyz-git

@hipudding I tried this:

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=debug -DLLAMA_CANN=on && make -j

got:

/home/ma-user/llama.cpp/ggml-impl.h:283:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^

I used OpenI (https://openi.pcl.ac.cn/) 910a / aarch64 environment.

Sorry, This PR is still under progress and can't be used now, Please wait for it release. Thanks.

May 23 '24 03:05 hipudding

@huyz-git @hipudding Thanks for your reply. I also used Huawei ModelArts

cmake .. -DCMAKE_BUILD_TYPE=debug -DLLAMA_CANN=on

Thanks for reply. I just want to participate during the development of this NPU PR, so I would like to know what environment can be used for compilation at the moment. I tried using Huawei Cloud ModelArts with gcc version 7.3, but I still encountered errors.

May 23 '24 10:05 leopeng1995

@huyz-git @hipudding Thanks for your reply. I also used Huawei ModelArts

cmake .. -DCMAKE_BUILD_TYPE=debug -DLLAMA_CANN=on

Thanks for reply. I just want to participate during the development of this NPU PR, so I would like to know what environment can be used for compilation at the moment. I tried using Huawei Cloud ModelArts with gcc version 7.3, but I still encountered errors.

Welcome to join this PR. I'm using gcc 9.4.0 and 910b.

May 23 '24 11:05 hipudding

Good news! The basic functionality of llama.cpp using Ascend NPU as the backend is now operational. However, the following issues remain:

1.	The inference speed is very slow; the following video shows the effect at 10x speed.
2.	Exceptions may be thrown during inference.
3.	Memory usage is quite high.

Nevertheless, the inference results indicate that using Ascend NPU is feasible, and these issues will be resolved soon.

db42ff46-f89a-4197-991c-2ae4020a43d0

Jun 19 '24 02:06 hipudding

known issues:

[ ] some new updated ops is not support by aclnn interface anymore. (UPSCALE, CONCAT);
[ ] ARGSORT is not using a stable sort algorithm in aclnn which will fail the testcase.
[ ] ROPE is not finished with all conditions, satisfied with llama.
[ ] DUP is not finished with all conditions. satisfied with llama.
[x] output is not reasonable after kv_cache updated. (which can reproduce by this: ./bin/llama-cli -m Llama3-8B-Chinese-Chat-f16-v2_1.gguf -c 512 -b 1024 --keep 48 -ngl 32 --split-mode none --repeat_penalty 1.0 --color -i -r "User:" -f ../prompts/chat-with-bob.txt).
[ ] Only support f16 now.
[ ] Only support llama.
[x] Slow!!

Good day! @slaren I'd like to know what standards should be met before merge this PR? Can this PR be merged first and then continue to fix the above problems? I can print an warning to show this backend is not stable enough when using the Ascend backend.

And for issue 4. Can you give me some tips? I've been analyzing it for a long time but couldn't find the reason.

Jul 05 '24 06:07 hipudding

Issue 6: slow!! has been initially fixed by memory reuse, almost speedup 10x. Also, there are other space for optimization.

Current inference speed: llama

Jul 05 '24 08:07 wangshuai09

I'd like to know what standards should be met before merge this PR? Can this PR be merged first and then continue to fix the above problems? I can print an warning to show this backend is not stable enough when using the Ascend backend.

My opinion is that the backend should be at least somewhat functional in a practical scenario before it should be merged. It does not need to support every model and every operation, or work with every hardware model, but work in progress with no practical applications should continue in the PR.

And for issue 4. Can you give me some tips? I've been analyzing it for a long time but couldn't find the reason.

In test-backend-ops, you can enable the test test_llama to compare the evaluation of an entire layer of a llama model with the CPU backend. This test is disabled normally, so you have to uncomment it in the test_backend function to use it. You could also use the eval-callback example to look at the outputs of the backend and compare them to the CPU backend to try to find which operation causes the problem.

Jul 06 '24 23:07 slaren

I'd like to know what standards should be met before merge this PR? Can this PR be merged first and then continue to fix the above problems? I can print an warning to show this backend is not stable enough when using the Ascend backend.

My opinion is that the backend should be at least somewhat functional in a practical scenario before it should be merged. It does not need to support every model and every operation, or work with every hardware model, but work in progress with no practical applications should continue in the PR.

And for issue 4. Can you give me some tips? I've been analyzing it for a long time but couldn't find the reason.

In test-backend-ops, you can enable the test test_llama to compare the evaluation of an entire layer of a llama model with the CPU backend. This test is disabled normally, so you have to uncomment it in the test_backend function to use it. You could also use the eval-callback example to look at the outputs of the backend and compare them to the CPU backend to try to find which operation causes the problem.

Got it, Thanks.

Jul 08 '24 00:07 hipudding

CANN version 7.0.1.1
gcc version 10.3.1
I have 8 * 910b I use test-backend-ops , but got this

Jul 10 '24 08:07 AllernChen

@AllernChen you can follow ggml-cann develop and we recommend use the latest CANN version. I'm using 8.0.RC2.alpha001 to develop.

Jul 10 '24 08:07 wangshuai09

Disabled ROPE operator, and it works!

Please NOTE that it only support fp16 and Llama.

Jul 11 '24 08:07 hipudding

You can also make an operation run on the CPU by returning false from supports_op.

Jul 11 '24 11:07 slaren

You can also make an operation run on the CPU by returning false from supports_op.

Yes, I have tried this way. but it has two problems:

I need to add -nkvo parameter, because it seems cpu backend is calculating a gpu tensor when update kv_cache.
The graph split into many small graphs, which is slower than this way.

But it's only a temporary fix before the ROPE operator bug fixed. It will revert after bug fix.

Thanks.

Jul 11 '24 11:07 hipudding

Rope operator has been fixed.

Jul 14 '24 00:07 wangshuai09

llama.cpp llama.cpp copied to clipboard

[CANN] Add Ascend NPU backend

llama.cpp
llama.cpp copied to clipboard