llama.cpp Fix flash-attn for AMD

This PR attempts to make the llama.cpp FlashAttention code run on AMD via HIP. I do not have an AMD GPU with tensor cores so I cannot test the code myself.

Edit: this PR needs rocWMMA https://github.com/ROCm/rocWMMA to be installed.

This PR also changes the order of some definitions in common.cuh because otherwise NO_DEVICE_CODE is not functional on AMD.

Apr 19 '24 18:04 JohannesGaessler

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 554 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8432.7ms p(95)=20146.68ms fails=, finish reason: stop=479 truncated=75
Prompt processing (pp): avg=104.89tk/s p(95)=505.33tk/s
Token generation (tg): avg=34.48tk/s p(95)=47.0tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=jg/flash-attn-amd commit=ac6ae5daca029e554af08281a3fd839169725c8c

prompt_tokens_seconds

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713551797 --> 1713552421
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 370.76, 370.76, 370.76, 370.76, 370.76, 690.32, 690.32, 690.32, 690.32, 690.32, 685.99, 685.99, 685.99, 685.99, 685.99, 727.61, 727.61, 727.61, 727.61, 727.61, 788.87, 788.87, 788.87, 788.87, 788.87, 785.93, 785.93, 785.93, 785.93, 785.93, 783.55, 783.55, 783.55, 783.55, 783.55, 806.51, 806.51, 806.51, 806.51, 806.51, 811.08, 811.08, 811.08, 811.08, 811.08, 819.73, 819.73, 819.73, 819.73, 819.73, 819.37, 819.37, 819.37, 819.37, 819.37, 845.45, 845.45, 845.45, 845.45, 845.45, 853.93, 853.93, 853.93, 853.93, 853.93, 866.95, 866.95, 866.95, 866.95, 866.95, 814.63, 814.63, 814.63, 814.63, 814.63, 797.98, 797.98, 797.98, 797.98, 797.98, 799.75, 799.75, 799.75, 799.75, 799.75, 797.54, 797.54, 797.54, 797.54, 797.54, 808.86, 808.86, 808.86, 808.86, 808.86, 808.81, 808.81, 808.81, 808.81, 808.81, 808.43, 808.43, 808.43, 808.43, 808.43, 814.93, 814.93, 814.93, 814.93, 814.93, 819.42, 819.42, 819.42, 819.42, 819.42, 837.72, 837.72, 837.72, 837.72, 837.72, 836.82, 836.82, 836.82, 836.82, 836.82, 830.97, 830.97, 830.97, 830.97, 830.97, 844.77, 844.77, 844.77, 844.77, 844.77, 841.46, 841.46, 841.46, 841.46, 841.46, 840.42, 840.42, 840.42, 840.42, 840.42, 840.76, 840.76, 840.76, 840.76, 840.76, 845.35, 845.35, 845.35, 845.35, 845.35, 844.62, 844.62, 844.62, 844.62, 844.62, 845.7, 845.7, 845.7, 845.7, 845.7, 851.0, 851.0, 851.0, 851.0, 851.0, 865.77, 865.77, 865.77, 865.77, 865.77, 877.85, 877.85, 877.85, 877.85, 877.85, 873.99, 873.99, 873.99, 873.99, 873.99, 872.27, 872.27, 872.27, 872.27, 872.27, 874.52, 874.52, 874.52, 874.52, 874.52, 875.99, 875.99, 875.99, 875.99, 875.99, 881.04, 881.04, 881.04, 881.04, 881.04, 872.55, 872.55, 872.55, 872.55, 872.55, 860.13, 860.13, 860.13, 860.13, 860.13, 858.55, 858.55, 858.55, 858.55, 858.55, 855.49, 855.49, 855.49, 855.49, 855.49, 860.4, 860.4, 860.4, 860.4, 860.4, 862.58, 862.58, 862.58, 862.58, 862.58, 861.13, 861.13, 861.13, 861.13, 861.13, 862.37, 862.37, 862.37, 862.37, 862.37, 865.12, 865.12, 865.12, 865.12, 865.12, 867.84, 867.84, 867.84, 867.84, 867.84, 867.15, 867.15, 867.15, 867.15, 867.15, 873.08, 873.08, 873.08, 873.08, 873.08, 869.69, 869.69, 869.69, 869.69, 869.69, 870.15, 870.15, 870.15, 870.15, 870.15, 869.09, 869.09, 869.09, 869.09, 869.09, 870.33, 870.33, 870.33, 870.33, 870.33, 870.85, 870.85, 870.85, 870.85, 870.85, 870.99, 870.99, 870.99, 870.99, 870.99, 871.96, 871.96, 871.96, 871.96, 871.96, 872.21, 872.21, 872.21]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713551797 --> 1713552421
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 42.01, 42.01, 42.01, 42.01, 42.01, 41.2, 41.2, 41.2, 41.2, 41.2, 31.2, 31.2, 31.2, 31.2, 31.2, 33.77, 33.77, 33.77, 33.77, 33.77, 33.84, 33.84, 33.84, 33.84, 33.84, 33.74, 33.74, 33.74, 33.74, 33.74, 34.35, 34.35, 34.35, 34.35, 34.35, 35.23, 35.23, 35.23, 35.23, 35.23, 35.19, 35.19, 35.19, 35.19, 35.19, 34.76, 34.76, 34.76, 34.76, 34.76, 33.99, 33.99, 33.99, 33.99, 33.99, 33.98, 33.98, 33.98, 33.98, 33.98, 34.04, 34.04, 34.04, 34.04, 34.04, 33.16, 33.16, 33.16, 33.16, 33.16, 33.17, 33.17, 33.17, 33.17, 33.17, 32.31, 32.31, 32.31, 32.31, 32.31, 32.41, 32.41, 32.41, 32.41, 32.41, 32.56, 32.56, 32.56, 32.56, 32.56, 32.47, 32.47, 32.47, 32.47, 32.47, 32.1, 32.1, 32.1, 32.1, 32.1, 31.92, 31.92, 31.92, 31.92, 31.92, 31.8, 31.8, 31.8, 31.8, 31.8, 32.02, 32.02, 32.02, 32.02, 32.02, 32.08, 32.08, 32.08, 32.08, 32.08, 31.81, 31.81, 31.81, 31.81, 31.81, 31.91, 31.91, 31.91, 31.91, 31.91, 31.96, 31.96, 31.96, 31.96, 31.96, 31.75, 31.75, 31.75, 31.75, 31.75, 31.61, 31.61, 31.61, 31.61, 31.61, 31.69, 31.69, 31.69, 31.69, 31.69, 31.88, 31.88, 31.88, 31.88, 31.88, 31.92, 31.92, 31.92, 31.92, 31.92, 32.21, 32.21, 32.21, 32.21, 32.21, 32.23, 32.23, 32.23, 32.23, 32.23, 32.08, 32.08, 32.08, 32.08, 32.08, 31.97, 31.97, 31.97, 31.97, 31.97, 31.8, 31.8, 31.8, 31.8, 31.8, 31.87, 31.87, 31.87, 31.87, 31.87, 32.0, 32.0, 32.0, 32.0, 32.0, 32.14, 32.14, 32.14, 32.14, 32.14, 32.25, 32.25, 32.25, 32.25, 32.25, 31.92, 31.92, 31.92, 31.92, 31.92, 31.9, 31.9, 31.9, 31.9, 31.9, 31.8, 31.8, 31.8, 31.8, 31.8, 30.59, 30.59, 30.59, 30.59, 30.59, 30.55, 30.55, 30.55, 30.55, 30.55, 30.52, 30.52, 30.52, 30.52, 30.52, 30.64, 30.64, 30.64, 30.64, 30.64, 30.63, 30.63, 30.63, 30.63, 30.63, 30.84, 30.84, 30.84, 30.84, 30.84, 30.79, 30.79, 30.79, 30.79, 30.79, 30.75, 30.75, 30.75, 30.75, 30.75, 30.78, 30.78, 30.78, 30.78, 30.78, 30.77, 30.77, 30.77, 30.77, 30.77, 30.86, 30.86, 30.86, 30.86, 30.86, 30.99, 30.99, 30.99, 30.99, 30.99, 31.02, 31.02, 31.02, 31.02, 31.02, 31.14, 31.14, 31.14, 31.14, 31.14, 31.16, 31.16, 31.16, 31.16, 31.16, 31.14, 31.14, 31.14, 31.14, 31.14, 31.14, 31.14, 31.14]

Details

kv_cache_usage_ratio

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713551797 --> 1713552421
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.35, 0.35, 0.35, 0.35, 0.35, 0.08, 0.08, 0.08, 0.08, 0.08, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.08, 0.08, 0.08, 0.08, 0.08, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.18, 0.18, 0.18, 0.18, 0.18, 0.33, 0.33, 0.33, 0.33, 0.33, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.28, 0.28, 0.28, 0.28, 0.28, 0.3, 0.3, 0.3, 0.3, 0.3, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.25, 0.25, 0.25, 0.25, 0.25, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.28, 0.28, 0.28, 0.28, 0.28, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.24, 0.24, 0.24, 0.24, 0.24, 0.54, 0.54, 0.54, 0.54, 0.54, 0.46, 0.46, 0.46, 0.46, 0.46, 0.49, 0.49, 0.49, 0.49, 0.49, 0.27, 0.27, 0.27, 0.27, 0.27, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.07, 0.07, 0.07, 0.07, 0.07, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713551797 --> 1713552421
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0]

Apr 19 '24 18:04 github-actions[bot]

[3/69] Building CXX object CMakeFiles/ggml.dir/ggml-cuda/fattn.cu.obj
FAILED: CMakeFiles/ggml.dir/ggml-cuda/fattn.cu.obj
ccache C:\PROGRA~1\AMD\ROCm\5.7\bin\CLANG_~1.EXE -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_CUDA -DGGML_USE_HIPBLAS -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -IW:/git/test/Johannes/llama.cpp/. -isystem "C:/Program Files/AMD/ROCm/5.7/include" -O3 -DNDEBUG -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -std=gnu++14 -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -march=native -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -x hip --offload-arch=gfx1100 -MD -MT CMakeFiles/ggml.dir/ggml-cuda/fattn.cu.obj -MF CMakeFiles\ggml.dir\ggml-cuda\fattn.cu.obj.d -o CMakeFiles/ggml.dir/ggml-cuda/fattn.cu.obj -c W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu
In file included from W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu:1:
In file included from W:/git/test/Johannes/llama.cpp/ggml-cuda/common.cuh:15:
W:/git/test/Johannes/llama.cpp/.\ggml-common.h:154:9: warning: anonymous types declared in an anonymous union are an extension [-Wnested-anon-types]
        struct {
        ^
W:/git/test/Johannes/llama.cpp/.\ggml-common.h:175:9: warning: anonymous types declared in an anonymous union are an extension [-Wnested-anon-types]
        struct {
        ^
W:/git/test/Johannes/llama.cpp/.\ggml-common.h:196:9: warning: anonymous types declared in an anonymous union are an extension [-Wnested-anon-types]
        struct {
        ^
W:/git/test/Johannes/llama.cpp/.\ggml-common.h:218:9: warning: anonymous types declared in an anonymous union are an extension [-Wnested-anon-types]
        struct {
        ^
W:/git/test/Johannes/llama.cpp/.\ggml-common.h:263:9: warning: anonymous types declared in an anonymous union are an extension [-Wnested-anon-types]
        struct {
        ^
W:/git/test/Johannes/llama.cpp/.\ggml-common.h:290:9: warning: anonymous types declared in an anonymous union are an extension [-Wnested-anon-types]
        struct {
        ^
In file included from W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu:1:
W:/git/test/Johannes/llama.cpp/ggml-cuda/common.cuh:347:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu:6:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
#if FP16_MMA_AVAILABLE
    ^
W:/git/test/Johannes/llama.cpp/ggml-cuda/common.cuh:402:28: note: expanded from macro 'FP16_MMA_AVAILABLE'
#define FP16_MMA_AVAILABLE defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
                           ^
W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu:6:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
W:/git/test/Johannes/llama.cpp/ggml-cuda/common.cuh:402:57: note: expanded from macro 'FP16_MMA_AVAILABLE'
#define FP16_MMA_AVAILABLE defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
                                                        ^
W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu:6:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
W:/git/test/Johannes/llama.cpp/ggml-cuda/common.cuh:403:41: note: expanded from macro 'FP16_MMA_AVAILABLE'
                                        defined(RDNA3) : __CUDA_ARCH__ >= CC_VOLTA
                                        ^
W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu:8:10: fatal error: 'rocwmma/rocwmma.hpp' file not found
#include <rocwmma/rocwmma.hpp>
         ^~~~~~~~~~~~~~~~~~~~~
10 warnings and 1 error generated when compiling for gfx1100.
ninja: build stopped: subcommand failed.

I don't think it work for gfx1100 at least not yet @JohannesGaessler

Apr 19 '24 18:04 sorasoras

Sorry, I forgot: you need to install rocWMMA https://github.com/ROCm/rocWMMA .

Apr 19 '24 19:04 JohannesGaessler

I just tried this PR (ac6ae5daca029e554af08281a3fd839169725c8c), but could not compile. I installed rocwmma from source and I'm using ROCm 6.0.2. I tried to complie with:

make -j32 LLAMA_HIPBLAS=1 AMDGPU_TARGETS=gfx1100

ggml-cuda/common.cuh:403:41: note: expanded from macro 'FP16_MMA_AVAILABLE'
                                        defined(RDNA3) : __CUDA_ARCH__ >= CC_VOLTA
                                        ^
In file included from ggml-cuda/fattn.cu:8:
In file included from /opt/rocm/include/rocwmma/rocwmma.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/io_config.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/broadcast.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/types.hpp:155:
/opt/rocm/include/rocwmma/internal/vector.hpp:111:18: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
inline constexpr auto next_pow2(uint32_t x)
                 ^
/opt/rocm/include/rocwmma/internal/vector.hpp:157:28: error: functions that differ only in their return type cannot be overloaded
        constexpr inline T operator[](unsigned int idx) const noexcept;
                         ~ ^
/opt/rocm/include/rocwmma/internal/vector.hpp:154:29: note: previous declaration is here
        constexpr inline T& operator[](unsigned int idx) noexcept;
                         ~~ ^
In file included from ggml-cuda/fattn.cu:8:
In file included from /opt/rocm/include/rocwmma/rocwmma.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/io_config.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/broadcast.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/types.hpp:155:
In file included from /opt/rocm/include/rocwmma/internal/vector.hpp:225:
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:41:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:49:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:57:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:65:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:74:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:83:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:97:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:107:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:117:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:127:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:137:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:147:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:161:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:171:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:181:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
                ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs)
                                                            ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:397:42: error: return type of out-of-line definition of 'rocwmma::non_native_vector_base::operator[]' differs from that in the declaration
        non_native_vector_base<T, Rank>::operator[](unsigned int idx) const noexcept
                                         ^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:390:42: note: previous definition is here
        non_native_vector_base<T, Rank>::operator[](unsigned int idx) noexcept
                                         ^
In file included from ggml-cuda/fattn.cu:8:
In file included from /opt/rocm/include/rocwmma/rocwmma.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/io_config.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/broadcast.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/types.hpp:155:
/opt/rocm/include/rocwmma/internal/vector.hpp:321:1: error: 'aligned' attribute requires integer constant
ROCWMMA_REGISTER_HIP_NON_NATIVE_VECTOR_TYPE(rocwmma::hfloat16_t, 1);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:704:50: note: expanded from macro 'ROCWMMA_REGISTER_HIP_NON_NATIVE_VECTOR_TYPE'
    ROCWMMA_REGISTER_HIP_VECTOR_BASE(TYPE, RANK, ROCWMMA_HIP_NON_NATIVE_VECTOR_STORAGE_IMPL)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:651:9: note: expanded from macro 'ROCWMMA_REGISTER_HIP_VECTOR_BASE'
        STORAGE_IMPL(TYPE, RANK);                                                              \
        ^~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:623:19: note: expanded from macro 'ROCWMMA_HIP_NON_NATIVE_VECTOR_STORAGE_IMPL'
    union alignas(next_pow2(RANK * sizeof(TYPE)))                    \
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fatal error: too many errors emitted, stopping now [-ferror-limit=]
11 warnings and 20 errors generated when compiling for gfx1100.

Apr 20 '24 09:04 tbocek

Well, this looks like it would be non-trivial to fix. I was hoping it would be possible to just use rocWMMA as a drop-in replacement. But as I said, I don't have an AMD GPU with tensor cores to debug this with. And quite honestly I don't want to invest a lot of time into this either way because as far as I am concerned none of the current AMD GPUs are worth buying anyways. So unless another dev wants to take over the current llama.cpp FlashAttention implementation using tensor cores will be NVIDIA only.

Apr 20 '24 09:04 JohannesGaessler

I just tried to compile with C++17, MK_CXXFLAGS="-std=c++17 -fPIC", now at least I don't see any rocm compilation issues. However I have these issues now:

ggml-cuda/fattn.cu:425:34: error: use of undeclared identifier '__hmax2'; did you mean '__hmax'?
                    KQ_max_new = __hmax2(KQ_max_new, KQ2_tmp[k0/WARP_SIZE]);
                                 ^~~~~~~
                                 __hmax
...
ggml-cuda/fattn.cu:425:42: error: no viable conversion from 'half2' (aka '__half2') to '__half'
                    KQ_max_new = __hmax2(KQ_max_new, KQ2_tmp[k0/WARP_SIZE]);
                                         ^~~~~~~~~~
...
ggml-cuda/fattn.cu:473:17: error: no matching function for call to 'fill_fragment'
                wmma::fill_fragment(VKQ_c[i_VKQ_0/VKQ_stride][j], 0.0f);
                ^~~~~~~~~~~~~~~~~~~

It could be something with my setup, as there should be a fill_fragment function in rocwmma.

Apr 20 '24 09:04 tbocek

nvm, there are still rocwmma compilation issue with c++17:

/opt/rocm/include/rocwmma/rocwmma_impl.hpp:189:9: note: candidate template ignored: deduced conflicting types for parameter 'DataT' ('__half' vs. 'float')
        fill_fragment(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout>& frag,
        ^

Something is not right with the types. @JohannesGaessler, indeed does not seem to look like an easy fix.

Apr 20 '24 09:04 tbocek

I managed to compile the branch.

I just tried to compile with C++17, MK_CXXFLAGS="-std=c++17 -fPIC", now at least I don't see any rocm compilation issues. However I have these issues now:

ggml-cuda/fattn.cu:425:34: error: use of undeclared identifier '__hmax2'; did you mean '__hmax'?
                    KQ_max_new = __hmax2(KQ_max_new, KQ2_tmp[k0/WARP_SIZE]);
                                 ^~~~~~~
                                 __hmax
...
ggml-cuda/fattn.cu:425:42: error: no viable conversion from 'half2' (aka '__half2') to '__half'
                    KQ_max_new = __hmax2(KQ_max_new, KQ2_tmp[k0/WARP_SIZE]);
                                         ^~~~~~~~~~
...
ggml-cuda/fattn.cu:473:17: error: no matching function for call to 'fill_fragment'
                wmma::fill_fragment(VKQ_c[i_VKQ_0/VKQ_stride][j], 0.0f);
                ^~~~~~~~~~~~~~~~~~~

It could be something with my setup, as there should be a fill_fragment function in rocwmma.

According to https://rocm.docs.amd.com/projects/HIPIFY/en/amd-staging/tables/CUDA_Device_API_supported_by_HIP.html, __hmax2 is not available in HIP yet. I made my own implementation meanwhile.

inline __device__ __half2 __hmax2(__half2 x, __half2 y) {
    return __half2_raw{
            {{__hmax(__half2_raw(x).x, __half2_raw(y).x),
            __hmax(__half2_raw(x).y, __half2_raw(y).y)}}
    };
}

I know next to nothing about HIP/Cuda, so this is probably wrong.

nvm, there are still rocwmma compilation issue with c++17:
/opt/rocm/include/rocwmma/rocwmma_impl.hpp:189:9: note: candidate template ignored: deduced conflicting types for parameter 'DataT' ('__half' vs. 'float')
        fill_fragment(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout>& frag,
        ^
Something is not right with the types. @JohannesGaessler, indeed does not seem to look like an easy fix.

For this error, here is my workaround:

--- a/ggml-cuda/fattn.cu	(revision ac6ae5daca029e554af08281a3fd839169725c8c)
+++ b/ggml-cuda/fattn.cu	(revision 21b0bf477a56122d8302b218579956b034deaa36)
@@ -339,7 +351,7 @@
             frag_c_KQ KQ_c[ncols/frag_n];
 #pragma unroll
             for (int j = 0; j < ncols/frag_n; ++j) {
-                wmma::fill_fragment(KQ_c[j], 0.0f);
+                wmma::fill_fragment(KQ_c[j], KQ_acc_t{0.0f});
             }
 #pragma unroll
             for (int k_KQ_0 = 0; k_KQ_0 < D; k_KQ_0 += 16) {
@@ -470,7 +482,7 @@
         for (int i_VKQ_0 = 0; i_VKQ_0 < D; i_VKQ_0 += VKQ_stride) {
 #pragma unroll
             for (int j = 0; j < ncols/frag_n; ++j) {
-                wmma::fill_fragment(VKQ_c[i_VKQ_0/VKQ_stride][j], 0.0f);
+                wmma::fill_fragment(VKQ_c[i_VKQ_0/VKQ_stride][j], __half{0.0f});
             }
 
 #pragma unroll

However, the resulting binary crashes, and it doesn't seem directly related to flash-attn:

/home/jerome/Prog/online/llama.cpp/build-test/bin/main --color     --chatml -ngl 75  -t 12 -tb 1 -c 2048 -m /home/jerome/Prog/online/oobabooga_linux/text-generation-webui/models/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf  -p "What is the meaning of life?"
Log start
main: build = 2814 (21b0bf47)
main: built with clang version 17.0.0 for x86_64-pc-linux-gnu
main: seed  = 1713626673
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /home/jerome/Prog/online/oobabooga_linux/text-generation-webui/models/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = nurtureai_openhermes-2.5-mistral-7b-16k
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 100000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.78 GiB (5.67 BPW) 
llm_load_print_meta: general.name     = nurtureai_openhermes-2.5-mistral-7b-16k
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  4807.06 MiB
llm_load_tensors:        CPU buffer size =    85.94 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 100000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.12 MiB
ggml_gallocr_reserve_n: reallocating ROCm0 buffer from size 0.00 MiB to 166.00 MiB
ggml_gallocr_reserve_n: reallocating ROCm_Host buffer from size 0.00 MiB to 12.01 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   166.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1031
llama_new_context_with_model: graph splits = 2
Memory access fault by GPU node-1 (Agent handle: 0x5b2e536c7dc0) on address 0x72b1a87fc000. Reason: Page not present or supervisor privilege.
[1]    54586 IOT instruction  /home/jerome/Prog/online/llama.cpp/build-test/bin/main --color --chatml -ngl

(lldb) bt
  thread #1, name = 'main'
    frame #0: 0x000078af48b2104f libc.so.6`ioctl + 63
    frame #1: 0x000078af48d67d11 libhsakmt.so.1`___lldb_unnamed_symbol205 + 49
    frame #2: 0x000078af48d61b47 libhsakmt.so.1`___lldb_unnamed_symbol164 + 199
    frame #3: 0x000078af48d6403b libhsakmt.so.1`___lldb_unnamed_symbol185 + 1163
    frame #4: 0x000078af48d6808f libhsakmt.so.1`hsaKmtAllocMemory + 335
    frame #5: 0x000078aedfe45018 libhsa-runtime64.so.1`___lldb_unnamed_symbol1645 + 232
    frame #6: 0x000078aedfe45287 libhsa-runtime64.so.1`___lldb_unnamed_symbol1646 + 55
    frame #7: 0x000078aedfe6ed7b libhsa-runtime64.so.1`___lldb_unnamed_symbol2190 + 75
    frame #8: 0x000078aedfe429ee libhsa-runtime64.so.1`___lldb_unnamed_symbol1603 + 110
    frame #9: 0x000078aedfe427cf libhsa-runtime64.so.1`___lldb_unnamed_symbol1602 + 191
    frame #10: 0x000078aedfe8ac52 libhsa-runtime64.so.1`___lldb_unnamed_symbol2397 + 178
    frame #11: 0x000078aedfe8e39c libhsa-runtime64.so.1`___lldb_unnamed_symbol2402 + 5916
    frame #12: 0x000078aedfe8f007 libhsa-runtime64.so.1`___lldb_unnamed_symbol2403 + 23
    frame #13: 0x000078aedfe516db libhsa-runtime64.so.1`___lldb_unnamed_symbol1873 + 203
    frame #14: 0x000078af49511749 libamdhip64.so.6`___lldb_unnamed_symbol3402 + 185
    frame #15: 0x000078af4955387a libamdhip64.so.6`___lldb_unnamed_symbol3681 + 602
    frame #16: 0x000078af492fcc79 libamdhip64.so.6`___lldb_unnamed_symbol2281 + 217
    frame #17: 0x000078af492c35b3 libamdhip64.so.6`___lldb_unnamed_symbol2153 + 339
    frame #18: 0x000078af4943d39f libamdhip64.so.6`___lldb_unnamed_symbol2689 + 159
    frame #19: 0x000078af4940f2de libamdhip64.so.6`___lldb_unnamed_symbol2660 + 142
    frame #20: 0x000078af4941071d libamdhip64.so.6`hipLaunchKernel + 749
    frame #21: 0x0000568dec65e076 main`void __device_stub__k_bin_bcast<&op_mul(float, float), float, float, float>(float const*, float const*, float*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int) + 390
    frame #22: 0x0000568dec65b8f6 main`void bin_bcast_cuda<&op_mul(float, float)>::operator()<float, float, float>(this=0x00007fffb4b0fa5f, src0=0x0000568e109fead0, src1=0x0000568e0fde5540, dst=0x0000568e109fec60, src0_dd=0x000078ad44601000, src1_dd=0x000078ab0968fa80, dst_dd=0x000078ad44601000, stream=0x0000568e10505f20) at binbcast.cu:234:17
    frame #23: 0x0000568dec652172 main`void ggml_cuda_op_bin_bcast<bin_bcast_cuda<&op_mul(float, float)>>(src0=0x0000568e109fead0, src1=0x0000568e0fde5540, dst=0x0000568e109fec60, src0_dd=0x000078ad44601000, src1_dd=0x000078ab0968fa80, dst_dd=0x000078ad44601000, stream=0x0000568e10505f20) at binbcast.cu:254:9
    frame #24: 0x0000568dec65209b main`ggml_cuda_op_mul(ctx=0x0000568e10350220, dst=0x0000568e109fec60) at binbcast.cu:275:5
    frame #25: 0x0000568dec645dc5 main`ggml_cuda_compute_forward(ctx=0x0000568e10350220, dst=0x0000568e109fec60) at ggml-cuda.cu:2186:13
    frame #26: 0x0000568dec645409 main`ggml_backend_cuda_graph_compute(backend=0x0000568e1058eb80, cgraph=0x0000568e0fe640f0) at ggml-cuda.cu:2434:19

Apr 20 '24 15:04 jdecourval

@jdecourval just tried your suggestions (I also know nothing about HIP/CUDA), but ChatGPT has a similar proposal for hmax2. And for me it compiles and works. I tried:

./main --color --chatml -ngl 75 -t 12 -tb 1 -c 2048 -m ../models/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf -p "What is the meaning of life?"

with the output:

Log start
main: build = 2813 (ac6ae5da)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1713629223
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from ../models/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = nurtureai_openhermes-2.5-mistral-7b-16k
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 100000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.78 GiB (5.67 BPW) 
llm_load_print_meta: general.name     = nurtureai_openhermes-2.5-mistral-7b-16k
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  4807.06 MiB
llm_load_tensors:        CPU buffer size =    85.94 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 100000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   166.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1031
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 12 (n_threads_batch = 1) / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
main: interactive mode on.
Reverse prompt: '<|im_start|>user
'
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 2048, n_batch = 2048, n_predict = -1, n_keep = 12


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|im_start|>system
What is the meaning of life?<|im_end|>
<|im_start|>user

> 
The meaning of life is a philosophical question that has been debated for centuries, and there is no universally agreed upon answer. It depends on one's personal beliefs, values, and cultural context. Some people believe that the meaning of life is to find happiness, fulfillment, or spiritual enlightenment, while others may find purpose in contributing to society or achieving personal goals. Ultimately, the meaning of life is subjective and varies from person to person.<|im_end|>

>

Apr 20 '24 16:04 tbocek

Thanks @tbocek, it was indeed something on my side, probably a leftover from another test.

Having the code compile is a nice improvement, but it still doesn't work here if I add the --flash_attn flag:

/home/jerome/Prog/online/llama.cpp/ggml-cuda/common.cuh:375: ERROR: HIP kernel warp_reduce_sum has no device code compatible with HIP arch 1300.

It comes from the fact that half2 warp_reduce_sum(half2 a); is behind a preprocessor test in common.cuh for HIP: https://github.com/ggerganov/llama.cpp/pull/6773/files#diff-7374e791d77a6a0f981b93a9985b7bbb3584fedf2b4ed18c796eca66a27a7917R367

However, simply removing the check does make it work!

--- a/ggml-cuda/common.cuh	(revision 21b0bf477a56122d8302b218579956b034deaa36)
+++ b/ggml-cuda/common.cuh	(date 1713632170170)
@@ -364,16 +364,11 @@
 }
 
 static __device__ __forceinline__ half2 warp_reduce_sum(half2 a) {
-#if !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_PASCAL
 #pragma unroll
    for (int mask = 16; mask > 0; mask >>= 1) {
        a = __hadd2(a, __shfl_xor_sync(0xffffffff, a, mask, 32));
    }
    return a;
-#else
-   GGML_UNUSED(a);
-   NO_DEVICE_CODE;
-#endif // !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_PASCAL
 }
 
 static __device__ __forceinline__ float warp_reduce_max(float x) {

@JohannesGaessler do you know why that check is there?

Apr 20 '24 17:04 jdecourval

That check is there because that is the old check for FlashAttention and I forgot to change it.

Apr 20 '24 17:04 JohannesGaessler

Here are some results. Token generation is about 2-3% faster.

❯ ./bin/llama-bench -m ggml-c4ai-command-r-35b-v01-iq4_xs.gguf -m openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf -m Nous-Hermes-2-Mixtral-8x7B-DPO.i1-IQ3_XXS.gguf -ngl 99 --flash-attn 1 --flash-attn 0 -r 10
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |         fa | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| command-r 35B IQ4_XS - 4.25 bpw |  19.48 GiB |    37.08 B | ROCm       |  99 |          1 | pp 512     |    744.02 ± 1.26 |
| command-r 35B IQ4_XS - 4.25 bpw |  19.48 GiB |    37.08 B | ROCm       |  99 |          1 | tg 128     |     38.04 ± 0.02 |
| command-r 35B IQ4_XS - 4.25 bpw |  19.48 GiB |    37.08 B | ROCm       |  99 |          0 | pp 512     |    737.22 ± 1.69 |
| command-r 35B IQ4_XS - 4.25 bpw |  19.48 GiB |    37.08 B | ROCm       |  99 |          0 | tg 128     |     37.60 ± 0.01 |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | ROCm       |  99 |          1 | pp 512     |  3269.33 ± 23.07 |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | ROCm       |  99 |          1 | tg 128     |     91.15 ± 0.02 |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | ROCm       |  99 |          0 | pp 512     |   3243.98 ± 5.28 |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | ROCm       |  99 |          0 | tg 128     |     88.92 ± 0.01 |
| llama 8x7B IQ3_XXS - 3.0625 bpw |  33.27 GiB |    91.80 B | ROCm       |  99 |          1 | pp 512     |   1190.03 ± 2.42 |
| llama 8x7B IQ3_XXS - 3.0625 bpw |  33.27 GiB |    91.80 B | ROCm       |  99 |          1 | tg 128     |     55.40 ± 0.17 |
| llama 8x7B IQ3_XXS - 3.0625 bpw |  33.27 GiB |    91.80 B | ROCm       |  99 |          0 | pp 512     |   1187.23 ± 4.88 |
| llama 8x7B IQ3_XXS - 3.0625 bpw |  33.27 GiB |    91.80 B | ROCm       |  99 |          0 | tg 128     |     54.11 ± 0.05 |

I pushed all my fixes to this branch: https://github.com/jdecourval/llama.cpp/tree/fixflashattn

Apr 20 '24 17:04 jdecourval

If it's only token generation that is faster then this PR is pretty much pointless because the FlashAttention kernel for batch size 1 does not use tensor cores at all (except for Phi-2). Also can you check llama-bench with -p 4096? That should be more sensitive to changes to the attention since the context is larger.

Apr 20 '24 18:04 JohannesGaessler

@jdecourval I applied your patches. Here are my results with -p 4096:

./llama-bench -m ../models/codellama-7b.Q4_K_M.gguf -m ../models/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf -ngl 99 --flash-attn 1 --flash-attn 0 -r 10 -p 4096
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
warning: debug build, performance may be affected
| model                          |       size |     params | backend    | ngl |         fa | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | ROCm       |  99 |          1 | pp 4096    |   2512.43 ± 7.36 |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | ROCm       |  99 |          1 | tg 128     |     90.73 ± 0.20 |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | ROCm       |  99 |          0 | pp 4096    |   2424.13 ± 2.14 |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | ROCm       |  99 |          0 | tg 128     |     89.74 ± 0.05 |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | ROCm       |  99 |          1 | pp 4096    |   2330.66 ± 3.34 |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | ROCm       |  99 |          1 | tg 128     |     84.18 ± 0.04 |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | ROCm       |  99 |          0 | pp 4096    |   2285.69 ± 2.18 |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | ROCm       |  99 |          0 | tg 128     |     82.02 ± 0.07 |

build: ac6ae5da (2813)

Apr 20 '24 18:04 tbocek

If it's only token generation that is faster then this PR is pretty much pointless because the FlashAttention kernel for batch size 1 does not use tensor cores at all (except for Phi-2). Also can you check llama-bench with -p 4096? That should be more sensitive to changes to the attention since the context is larger.

And here are mine

model	size	params	backend	ngl	fa	test	t/s
command-r 35B IQ3_XS - 3.3 bpw	15.65 GiB	37.08 B	ROCm	99	1	pp 4096	659.82 ± 0.86
command-r 35B IQ3_XS - 3.3 bpw	15.65 GiB	37.08 B	ROCm	99	1	tg 128	28.86 ± 0.00
command-r 35B IQ3_XS - 3.3 bpw	15.65 GiB	37.08 B	ROCm	99	0	pp 4096	640.81 ± 0.28
command-r 35B IQ3_XS - 3.3 bpw	15.65 GiB	37.08 B	ROCm	99	0	tg 128	28.58 ± 0.02
llama 7B Q5_K - Medium	4.78 GiB	7.24 B	ROCm	99	1	pp 4096	2671.34 ± 2.17
llama 7B Q5_K - Medium	4.78 GiB	7.24 B	ROCm	99	1	tg 128	91.06 ± 0.03
llama 7B Q5_K - Medium	4.78 GiB	7.24 B	ROCm	99	0	pp 4096	2615.97 ± 1.84
llama 7B Q5_K - Medium	4.78 GiB	7.24 B	ROCm	99	0	tg 128	89.34 ± 0.04
llama 8x7B IQ3_XXS - 3.0625 bpw	33.27 GiB	91.80 B	ROCm	99	1	pp 4096	1097.23 ± 1.43
llama 8x7B IQ3_XXS - 3.0625 bpw	33.27 GiB	91.80 B	ROCm	99	1	tg 128	55.27 ± 0.05
llama 8x7B IQ3_XXS - 3.0625 bpw	33.27 GiB	91.80 B	ROCm	99	0	pp 4096	1091.82 ± 2.33
llama 8x7B IQ3_XXS - 3.0625 bpw	33.27 GiB	91.80 B	ROCm	99	0	tg 128	54.30 ± 0.02
llama 7B Q6_K	5.53 GiB	7.24 B	ROCm	99	1	pp 4096	2698.18 ± 1.79
llama 7B Q6_K	5.53 GiB	7.24 B	ROCm	99	1	tg 128	89.60 ± 0.08
llama 7B Q6_K	5.53 GiB	7.24 B	ROCm	99	0	pp 4096	2650.79 ± 2.95
llama 7B Q6_K	5.53 GiB	7.24 B	ROCm	99	0	tg 128	87.52 ± 0.02
llama ?B Q4_K - Small	17.59 GiB	33.34 B	ROCm	99	1	pp 4096	680.71 ± 0.58
llama ?B Q4_K - Small	17.59 GiB	33.34 B	ROCm	99	1	tg 128	28.17 ± 0.02
llama ?B Q4_K - Small	17.59 GiB	33.34 B	ROCm	99	0	pp 4096	649.49 ± 0.40
llama ?B Q4_K - Small	17.59 GiB	33.34 B	ROCm	99	0	tg 128	27.55 ± 0.01
llama 70B IQ1_M - 1.75 bpw	15.59 GiB	70.55 B	ROCm	99	1	pp 4096	335.77 ± 0.31
llama 70B IQ1_M - 1.75 bpw	15.59 GiB	70.55 B	ROCm	99	1	tg 128	26.80 ± 0.01
llama 70B IQ1_M - 1.75 bpw	15.59 GiB	70.55 B	ROCm	99	0	pp 4096	327.34 ± 0.14
llama 70B IQ1_M - 1.75 bpw	15.59 GiB	70.55 B	ROCm	99	0	tg 128	26.25 ± 0.01

Updated with more models

Apr 20 '24 18:04 jdecourval

My current stance is that I don't think that a speedup of 2% is large enough to justify adding a dependency, especially when there is no dev with the ability to test and support the implementation themselves. I think right now it makes more sense to just disable FlashAttention for AMD. One of my next goals is to write kernels that don't use tensor cores at all. It may turn out that those are faster on AMD than rocWMMA anyways.

Apr 24 '24 13:04 JohannesGaessler

Alright. Thank you very much for the help.

I will update the target branch to disable flash attention when HIP is enabled for now

Apr 24 '24 13:04 ggerganov

My current stance is that I don't think that a speedup of 2% is large enough to justify adding a dependency, especially when there is no dev with the ability to test and support the implementation themselves. I think right now it makes more sense to just disable FlashAttention for AMD. One of my next goals is to write kernels that don't use tensor cores at all. It may turn out that those are faster on AMD than rocWMMA anyways.

I agree that a 2% performance improvement is not much, but the 500-600MB VRAM reduction may be significant: https://github.com/ggerganov/llama.cpp/pull/5021#issuecomment-2068039849

Apr 24 '24 14:04 jdecourval

On Arch Linux at least, I had to manually install the rocWMMA dependency from github, as there is no current library (also not in AUR).

Apr 24 '24 14:04 tbocek

On Arch Linux at least, I had to manually install the rocWMMA dependency from github, as there is no current library (also not in AUR).

There's this one: https://aur.archlinux.org/packages/rocwmma But I manually updated pkgver inside the PKGBUILD to current ArchLinux's ROCm version, 6.0.2.

Apr 24 '24 14:04 jdecourval

Obsolete now that https://github.com/ggerganov/llama.cpp/pull/5021 has been merged.

Apr 30 '24 10:04 JohannesGaessler

llama.cpp llama.cpp copied to clipboard

Fix flash-attn for AMD

llama.cpp
llama.cpp copied to clipboard