llama.cpp
llama.cpp copied to clipboard
Fix flash-attn for AMD
This PR attempts to make the llama.cpp FlashAttention code run on AMD via HIP. I do not have an AMD GPU with tensor cores so I cannot test the code myself.
Edit: this PR needs rocWMMA https://github.com/ROCm/rocWMMA to be installed.
This PR also changes the order of some definitions in common.cuh
because otherwise NO_DEVICE_CODE
is not functional on AMD.
📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2
-q4_0
: 554 iterations 🚀
Expand details for performance related PR only
- Concurrent users: 8, duration: 10m
- HTTP request : avg=8432.7ms p(95)=20146.68ms fails=, finish reason: stop=479 truncated=75
- Prompt processing (pp): avg=104.89tk/s p(95)=505.33tk/s
- Token generation (tg): avg=34.48tk/s p(95)=47.0tk/s
- ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=jg/flash-attn-amd commit=ac6ae5daca029e554af08281a3fd839169725c8c
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 554 iterations"
y-axis "llamacpp:prompt_tokens_seconds"
x-axis "llamacpp:prompt_tokens_seconds" 1713551797 --> 1713552421
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 370.76, 370.76, 370.76, 370.76, 370.76, 690.32, 690.32, 690.32, 690.32, 690.32, 685.99, 685.99, 685.99, 685.99, 685.99, 727.61, 727.61, 727.61, 727.61, 727.61, 788.87, 788.87, 788.87, 788.87, 788.87, 785.93, 785.93, 785.93, 785.93, 785.93, 783.55, 783.55, 783.55, 783.55, 783.55, 806.51, 806.51, 806.51, 806.51, 806.51, 811.08, 811.08, 811.08, 811.08, 811.08, 819.73, 819.73, 819.73, 819.73, 819.73, 819.37, 819.37, 819.37, 819.37, 819.37, 845.45, 845.45, 845.45, 845.45, 845.45, 853.93, 853.93, 853.93, 853.93, 853.93, 866.95, 866.95, 866.95, 866.95, 866.95, 814.63, 814.63, 814.63, 814.63, 814.63, 797.98, 797.98, 797.98, 797.98, 797.98, 799.75, 799.75, 799.75, 799.75, 799.75, 797.54, 797.54, 797.54, 797.54, 797.54, 808.86, 808.86, 808.86, 808.86, 808.86, 808.81, 808.81, 808.81, 808.81, 808.81, 808.43, 808.43, 808.43, 808.43, 808.43, 814.93, 814.93, 814.93, 814.93, 814.93, 819.42, 819.42, 819.42, 819.42, 819.42, 837.72, 837.72, 837.72, 837.72, 837.72, 836.82, 836.82, 836.82, 836.82, 836.82, 830.97, 830.97, 830.97, 830.97, 830.97, 844.77, 844.77, 844.77, 844.77, 844.77, 841.46, 841.46, 841.46, 841.46, 841.46, 840.42, 840.42, 840.42, 840.42, 840.42, 840.76, 840.76, 840.76, 840.76, 840.76, 845.35, 845.35, 845.35, 845.35, 845.35, 844.62, 844.62, 844.62, 844.62, 844.62, 845.7, 845.7, 845.7, 845.7, 845.7, 851.0, 851.0, 851.0, 851.0, 851.0, 865.77, 865.77, 865.77, 865.77, 865.77, 877.85, 877.85, 877.85, 877.85, 877.85, 873.99, 873.99, 873.99, 873.99, 873.99, 872.27, 872.27, 872.27, 872.27, 872.27, 874.52, 874.52, 874.52, 874.52, 874.52, 875.99, 875.99, 875.99, 875.99, 875.99, 881.04, 881.04, 881.04, 881.04, 881.04, 872.55, 872.55, 872.55, 872.55, 872.55, 860.13, 860.13, 860.13, 860.13, 860.13, 858.55, 858.55, 858.55, 858.55, 858.55, 855.49, 855.49, 855.49, 855.49, 855.49, 860.4, 860.4, 860.4, 860.4, 860.4, 862.58, 862.58, 862.58, 862.58, 862.58, 861.13, 861.13, 861.13, 861.13, 861.13, 862.37, 862.37, 862.37, 862.37, 862.37, 865.12, 865.12, 865.12, 865.12, 865.12, 867.84, 867.84, 867.84, 867.84, 867.84, 867.15, 867.15, 867.15, 867.15, 867.15, 873.08, 873.08, 873.08, 873.08, 873.08, 869.69, 869.69, 869.69, 869.69, 869.69, 870.15, 870.15, 870.15, 870.15, 870.15, 869.09, 869.09, 869.09, 869.09, 869.09, 870.33, 870.33, 870.33, 870.33, 870.33, 870.85, 870.85, 870.85, 870.85, 870.85, 870.99, 870.99, 870.99, 870.99, 870.99, 871.96, 871.96, 871.96, 871.96, 871.96, 872.21, 872.21, 872.21]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 554 iterations"
y-axis "llamacpp:predicted_tokens_seconds"
x-axis "llamacpp:predicted_tokens_seconds" 1713551797 --> 1713552421
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 42.01, 42.01, 42.01, 42.01, 42.01, 41.2, 41.2, 41.2, 41.2, 41.2, 31.2, 31.2, 31.2, 31.2, 31.2, 33.77, 33.77, 33.77, 33.77, 33.77, 33.84, 33.84, 33.84, 33.84, 33.84, 33.74, 33.74, 33.74, 33.74, 33.74, 34.35, 34.35, 34.35, 34.35, 34.35, 35.23, 35.23, 35.23, 35.23, 35.23, 35.19, 35.19, 35.19, 35.19, 35.19, 34.76, 34.76, 34.76, 34.76, 34.76, 33.99, 33.99, 33.99, 33.99, 33.99, 33.98, 33.98, 33.98, 33.98, 33.98, 34.04, 34.04, 34.04, 34.04, 34.04, 33.16, 33.16, 33.16, 33.16, 33.16, 33.17, 33.17, 33.17, 33.17, 33.17, 32.31, 32.31, 32.31, 32.31, 32.31, 32.41, 32.41, 32.41, 32.41, 32.41, 32.56, 32.56, 32.56, 32.56, 32.56, 32.47, 32.47, 32.47, 32.47, 32.47, 32.1, 32.1, 32.1, 32.1, 32.1, 31.92, 31.92, 31.92, 31.92, 31.92, 31.8, 31.8, 31.8, 31.8, 31.8, 32.02, 32.02, 32.02, 32.02, 32.02, 32.08, 32.08, 32.08, 32.08, 32.08, 31.81, 31.81, 31.81, 31.81, 31.81, 31.91, 31.91, 31.91, 31.91, 31.91, 31.96, 31.96, 31.96, 31.96, 31.96, 31.75, 31.75, 31.75, 31.75, 31.75, 31.61, 31.61, 31.61, 31.61, 31.61, 31.69, 31.69, 31.69, 31.69, 31.69, 31.88, 31.88, 31.88, 31.88, 31.88, 31.92, 31.92, 31.92, 31.92, 31.92, 32.21, 32.21, 32.21, 32.21, 32.21, 32.23, 32.23, 32.23, 32.23, 32.23, 32.08, 32.08, 32.08, 32.08, 32.08, 31.97, 31.97, 31.97, 31.97, 31.97, 31.8, 31.8, 31.8, 31.8, 31.8, 31.87, 31.87, 31.87, 31.87, 31.87, 32.0, 32.0, 32.0, 32.0, 32.0, 32.14, 32.14, 32.14, 32.14, 32.14, 32.25, 32.25, 32.25, 32.25, 32.25, 31.92, 31.92, 31.92, 31.92, 31.92, 31.9, 31.9, 31.9, 31.9, 31.9, 31.8, 31.8, 31.8, 31.8, 31.8, 30.59, 30.59, 30.59, 30.59, 30.59, 30.55, 30.55, 30.55, 30.55, 30.55, 30.52, 30.52, 30.52, 30.52, 30.52, 30.64, 30.64, 30.64, 30.64, 30.64, 30.63, 30.63, 30.63, 30.63, 30.63, 30.84, 30.84, 30.84, 30.84, 30.84, 30.79, 30.79, 30.79, 30.79, 30.79, 30.75, 30.75, 30.75, 30.75, 30.75, 30.78, 30.78, 30.78, 30.78, 30.78, 30.77, 30.77, 30.77, 30.77, 30.77, 30.86, 30.86, 30.86, 30.86, 30.86, 30.99, 30.99, 30.99, 30.99, 30.99, 31.02, 31.02, 31.02, 31.02, 31.02, 31.14, 31.14, 31.14, 31.14, 31.14, 31.16, 31.16, 31.16, 31.16, 31.16, 31.14, 31.14, 31.14, 31.14, 31.14, 31.14, 31.14, 31.14]
Details
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 554 iterations"
y-axis "llamacpp:kv_cache_usage_ratio"
x-axis "llamacpp:kv_cache_usage_ratio" 1713551797 --> 1713552421
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.35, 0.35, 0.35, 0.35, 0.35, 0.08, 0.08, 0.08, 0.08, 0.08, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.08, 0.08, 0.08, 0.08, 0.08, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.18, 0.18, 0.18, 0.18, 0.18, 0.33, 0.33, 0.33, 0.33, 0.33, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.28, 0.28, 0.28, 0.28, 0.28, 0.3, 0.3, 0.3, 0.3, 0.3, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.25, 0.25, 0.25, 0.25, 0.25, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.28, 0.28, 0.28, 0.28, 0.28, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.24, 0.24, 0.24, 0.24, 0.24, 0.54, 0.54, 0.54, 0.54, 0.54, 0.46, 0.46, 0.46, 0.46, 0.46, 0.49, 0.49, 0.49, 0.49, 0.49, 0.27, 0.27, 0.27, 0.27, 0.27, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.07, 0.07, 0.07, 0.07, 0.07, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 554 iterations"
y-axis "llamacpp:requests_processing"
x-axis "llamacpp:requests_processing" 1713551797 --> 1713552421
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0]
[3/69] Building CXX object CMakeFiles/ggml.dir/ggml-cuda/fattn.cu.obj
FAILED: CMakeFiles/ggml.dir/ggml-cuda/fattn.cu.obj
ccache C:\PROGRA~1\AMD\ROCm\5.7\bin\CLANG_~1.EXE -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_CUDA -DGGML_USE_HIPBLAS -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -IW:/git/test/Johannes/llama.cpp/. -isystem "C:/Program Files/AMD/ROCm/5.7/include" -O3 -DNDEBUG -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -std=gnu++14 -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -march=native -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -x hip --offload-arch=gfx1100 -MD -MT CMakeFiles/ggml.dir/ggml-cuda/fattn.cu.obj -MF CMakeFiles\ggml.dir\ggml-cuda\fattn.cu.obj.d -o CMakeFiles/ggml.dir/ggml-cuda/fattn.cu.obj -c W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu
In file included from W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu:1:
In file included from W:/git/test/Johannes/llama.cpp/ggml-cuda/common.cuh:15:
W:/git/test/Johannes/llama.cpp/.\ggml-common.h:154:9: warning: anonymous types declared in an anonymous union are an extension [-Wnested-anon-types]
struct {
^
W:/git/test/Johannes/llama.cpp/.\ggml-common.h:175:9: warning: anonymous types declared in an anonymous union are an extension [-Wnested-anon-types]
struct {
^
W:/git/test/Johannes/llama.cpp/.\ggml-common.h:196:9: warning: anonymous types declared in an anonymous union are an extension [-Wnested-anon-types]
struct {
^
W:/git/test/Johannes/llama.cpp/.\ggml-common.h:218:9: warning: anonymous types declared in an anonymous union are an extension [-Wnested-anon-types]
struct {
^
W:/git/test/Johannes/llama.cpp/.\ggml-common.h:263:9: warning: anonymous types declared in an anonymous union are an extension [-Wnested-anon-types]
struct {
^
W:/git/test/Johannes/llama.cpp/.\ggml-common.h:290:9: warning: anonymous types declared in an anonymous union are an extension [-Wnested-anon-types]
struct {
^
In file included from W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu:1:
W:/git/test/Johannes/llama.cpp/ggml-cuda/common.cuh:347:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu:6:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
#if FP16_MMA_AVAILABLE
^
W:/git/test/Johannes/llama.cpp/ggml-cuda/common.cuh:402:28: note: expanded from macro 'FP16_MMA_AVAILABLE'
#define FP16_MMA_AVAILABLE defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
^
W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu:6:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
W:/git/test/Johannes/llama.cpp/ggml-cuda/common.cuh:402:57: note: expanded from macro 'FP16_MMA_AVAILABLE'
#define FP16_MMA_AVAILABLE defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
^
W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu:6:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
W:/git/test/Johannes/llama.cpp/ggml-cuda/common.cuh:403:41: note: expanded from macro 'FP16_MMA_AVAILABLE'
defined(RDNA3) : __CUDA_ARCH__ >= CC_VOLTA
^
W:/git/test/Johannes/llama.cpp/ggml-cuda/fattn.cu:8:10: fatal error: 'rocwmma/rocwmma.hpp' file not found
#include <rocwmma/rocwmma.hpp>
^~~~~~~~~~~~~~~~~~~~~
10 warnings and 1 error generated when compiling for gfx1100.
ninja: build stopped: subcommand failed.
I don't think it work for gfx1100 at least not yet @JohannesGaessler
Sorry, I forgot: you need to install rocWMMA https://github.com/ROCm/rocWMMA .
I just tried this PR (ac6ae5daca029e554af08281a3fd839169725c8c), but could not compile. I installed rocwmma from source and I'm using ROCm 6.0.2. I tried to complie with:
make -j32 LLAMA_HIPBLAS=1 AMDGPU_TARGETS=gfx1100
ggml-cuda/common.cuh:403:41: note: expanded from macro 'FP16_MMA_AVAILABLE'
defined(RDNA3) : __CUDA_ARCH__ >= CC_VOLTA
^
In file included from ggml-cuda/fattn.cu:8:
In file included from /opt/rocm/include/rocwmma/rocwmma.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/io_config.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/broadcast.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/types.hpp:155:
/opt/rocm/include/rocwmma/internal/vector.hpp:111:18: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
inline constexpr auto next_pow2(uint32_t x)
^
/opt/rocm/include/rocwmma/internal/vector.hpp:157:28: error: functions that differ only in their return type cannot be overloaded
constexpr inline T operator[](unsigned int idx) const noexcept;
~ ^
/opt/rocm/include/rocwmma/internal/vector.hpp:154:29: note: previous declaration is here
constexpr inline T& operator[](unsigned int idx) noexcept;
~~ ^
In file included from ggml-cuda/fattn.cu:8:
In file included from /opt/rocm/include/rocwmma/rocwmma.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/io_config.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/broadcast.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/types.hpp:155:
In file included from /opt/rocm/include/rocwmma/internal/vector.hpp:225:
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:41:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:49:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:57:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:65:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:74:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:83:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:97:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:107:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:117:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:127:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:137:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:147:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:161:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:171:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs, TT rhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:181:61: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
ROCWMMA_HOST_DEVICE constexpr static inline auto exec(TT lhs)
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:397:42: error: return type of out-of-line definition of 'rocwmma::non_native_vector_base::operator[]' differs from that in the declaration
non_native_vector_base<T, Rank>::operator[](unsigned int idx) const noexcept
^
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:390:42: note: previous definition is here
non_native_vector_base<T, Rank>::operator[](unsigned int idx) noexcept
^
In file included from ggml-cuda/fattn.cu:8:
In file included from /opt/rocm/include/rocwmma/rocwmma.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/io_config.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/broadcast.hpp:29:
In file included from /opt/rocm/include/rocwmma/internal/types.hpp:155:
/opt/rocm/include/rocwmma/internal/vector.hpp:321:1: error: 'aligned' attribute requires integer constant
ROCWMMA_REGISTER_HIP_NON_NATIVE_VECTOR_TYPE(rocwmma::hfloat16_t, 1);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:704:50: note: expanded from macro 'ROCWMMA_REGISTER_HIP_NON_NATIVE_VECTOR_TYPE'
ROCWMMA_REGISTER_HIP_VECTOR_BASE(TYPE, RANK, ROCWMMA_HIP_NON_NATIVE_VECTOR_STORAGE_IMPL)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:651:9: note: expanded from macro 'ROCWMMA_REGISTER_HIP_VECTOR_BASE'
STORAGE_IMPL(TYPE, RANK); \
^~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/include/rocwmma/internal/vector_impl.hpp:623:19: note: expanded from macro 'ROCWMMA_HIP_NON_NATIVE_VECTOR_STORAGE_IMPL'
union alignas(next_pow2(RANK * sizeof(TYPE))) \
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fatal error: too many errors emitted, stopping now [-ferror-limit=]
11 warnings and 20 errors generated when compiling for gfx1100.
Well, this looks like it would be non-trivial to fix. I was hoping it would be possible to just use rocWMMA as a drop-in replacement. But as I said, I don't have an AMD GPU with tensor cores to debug this with. And quite honestly I don't want to invest a lot of time into this either way because as far as I am concerned none of the current AMD GPUs are worth buying anyways. So unless another dev wants to take over the current llama.cpp FlashAttention implementation using tensor cores will be NVIDIA only.
I just tried to compile with C++17, MK_CXXFLAGS="-std=c++17 -fPIC"
, now at least I don't see any rocm compilation issues. However I have these issues now:
ggml-cuda/fattn.cu:425:34: error: use of undeclared identifier '__hmax2'; did you mean '__hmax'?
KQ_max_new = __hmax2(KQ_max_new, KQ2_tmp[k0/WARP_SIZE]);
^~~~~~~
__hmax
...
ggml-cuda/fattn.cu:425:42: error: no viable conversion from 'half2' (aka '__half2') to '__half'
KQ_max_new = __hmax2(KQ_max_new, KQ2_tmp[k0/WARP_SIZE]);
^~~~~~~~~~
...
ggml-cuda/fattn.cu:473:17: error: no matching function for call to 'fill_fragment'
wmma::fill_fragment(VKQ_c[i_VKQ_0/VKQ_stride][j], 0.0f);
^~~~~~~~~~~~~~~~~~~
It could be something with my setup, as there should be a fill_fragment function in rocwmma.
nvm, there are still rocwmma compilation issue with c++17:
/opt/rocm/include/rocwmma/rocwmma_impl.hpp:189:9: note: candidate template ignored: deduced conflicting types for parameter 'DataT' ('__half' vs. 'float')
fill_fragment(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout>& frag,
^
Something is not right with the types. @JohannesGaessler, indeed does not seem to look like an easy fix.
I managed to compile the branch.
I just tried to compile with C++17,
MK_CXXFLAGS="-std=c++17 -fPIC"
, now at least I don't see any rocm compilation issues. However I have these issues now:ggml-cuda/fattn.cu:425:34: error: use of undeclared identifier '__hmax2'; did you mean '__hmax'? KQ_max_new = __hmax2(KQ_max_new, KQ2_tmp[k0/WARP_SIZE]); ^~~~~~~ __hmax ... ggml-cuda/fattn.cu:425:42: error: no viable conversion from 'half2' (aka '__half2') to '__half' KQ_max_new = __hmax2(KQ_max_new, KQ2_tmp[k0/WARP_SIZE]); ^~~~~~~~~~ ... ggml-cuda/fattn.cu:473:17: error: no matching function for call to 'fill_fragment' wmma::fill_fragment(VKQ_c[i_VKQ_0/VKQ_stride][j], 0.0f); ^~~~~~~~~~~~~~~~~~~
It could be something with my setup, as there should be a fill_fragment function in rocwmma.
According to https://rocm.docs.amd.com/projects/HIPIFY/en/amd-staging/tables/CUDA_Device_API_supported_by_HIP.html, __hmax2
is not available in HIP yet. I made my own implementation meanwhile.
inline __device__ __half2 __hmax2(__half2 x, __half2 y) {
return __half2_raw{
{{__hmax(__half2_raw(x).x, __half2_raw(y).x),
__hmax(__half2_raw(x).y, __half2_raw(y).y)}}
};
}
I know next to nothing about HIP/Cuda, so this is probably wrong.
nvm, there are still rocwmma compilation issue with c++17:
/opt/rocm/include/rocwmma/rocwmma_impl.hpp:189:9: note: candidate template ignored: deduced conflicting types for parameter 'DataT' ('__half' vs. 'float') fill_fragment(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout>& frag, ^
Something is not right with the types. @JohannesGaessler, indeed does not seem to look like an easy fix.
For this error, here is my workaround:
--- a/ggml-cuda/fattn.cu (revision ac6ae5daca029e554af08281a3fd839169725c8c)
+++ b/ggml-cuda/fattn.cu (revision 21b0bf477a56122d8302b218579956b034deaa36)
@@ -339,7 +351,7 @@
frag_c_KQ KQ_c[ncols/frag_n];
#pragma unroll
for (int j = 0; j < ncols/frag_n; ++j) {
- wmma::fill_fragment(KQ_c[j], 0.0f);
+ wmma::fill_fragment(KQ_c[j], KQ_acc_t{0.0f});
}
#pragma unroll
for (int k_KQ_0 = 0; k_KQ_0 < D; k_KQ_0 += 16) {
@@ -470,7 +482,7 @@
for (int i_VKQ_0 = 0; i_VKQ_0 < D; i_VKQ_0 += VKQ_stride) {
#pragma unroll
for (int j = 0; j < ncols/frag_n; ++j) {
- wmma::fill_fragment(VKQ_c[i_VKQ_0/VKQ_stride][j], 0.0f);
+ wmma::fill_fragment(VKQ_c[i_VKQ_0/VKQ_stride][j], __half{0.0f});
}
#pragma unroll
However, the resulting binary crashes, and it doesn't seem directly related to flash-attn:
/home/jerome/Prog/online/llama.cpp/build-test/bin/main --color --chatml -ngl 75 -t 12 -tb 1 -c 2048 -m /home/jerome/Prog/online/oobabooga_linux/text-generation-webui/models/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf -p "What is the meaning of life?"
Log start
main: build = 2814 (21b0bf47)
main: built with clang version 17.0.0 for x86_64-pc-linux-gnu
main: seed = 1713626673
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /home/jerome/Prog/online/oobabooga_linux/text-generation-webui/models/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = nurtureai_openhermes-2.5-mistral-7b-16k
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 100000.000000
llama_model_loader: - kv 11: general.file_type u32 = 17
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q5_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32002
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 100000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.78 GiB (5.67 BPW)
llm_load_print_meta: general.name = nurtureai_openhermes-2.5-mistral-7b-16k
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: ROCm0 buffer size = 4807.06 MiB
llm_load_tensors: CPU buffer size = 85.94 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 100000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB
ggml_gallocr_reserve_n: reallocating ROCm0 buffer from size 0.00 MiB to 166.00 MiB
ggml_gallocr_reserve_n: reallocating ROCm_Host buffer from size 0.00 MiB to 12.01 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 166.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1031
llama_new_context_with_model: graph splits = 2
Memory access fault by GPU node-1 (Agent handle: 0x5b2e536c7dc0) on address 0x72b1a87fc000. Reason: Page not present or supervisor privilege.
[1] 54586 IOT instruction /home/jerome/Prog/online/llama.cpp/build-test/bin/main --color --chatml -ngl
(lldb) bt
thread #1, name = 'main'
frame #0: 0x000078af48b2104f libc.so.6`ioctl + 63
frame #1: 0x000078af48d67d11 libhsakmt.so.1`___lldb_unnamed_symbol205 + 49
frame #2: 0x000078af48d61b47 libhsakmt.so.1`___lldb_unnamed_symbol164 + 199
frame #3: 0x000078af48d6403b libhsakmt.so.1`___lldb_unnamed_symbol185 + 1163
frame #4: 0x000078af48d6808f libhsakmt.so.1`hsaKmtAllocMemory + 335
frame #5: 0x000078aedfe45018 libhsa-runtime64.so.1`___lldb_unnamed_symbol1645 + 232
frame #6: 0x000078aedfe45287 libhsa-runtime64.so.1`___lldb_unnamed_symbol1646 + 55
frame #7: 0x000078aedfe6ed7b libhsa-runtime64.so.1`___lldb_unnamed_symbol2190 + 75
frame #8: 0x000078aedfe429ee libhsa-runtime64.so.1`___lldb_unnamed_symbol1603 + 110
frame #9: 0x000078aedfe427cf libhsa-runtime64.so.1`___lldb_unnamed_symbol1602 + 191
frame #10: 0x000078aedfe8ac52 libhsa-runtime64.so.1`___lldb_unnamed_symbol2397 + 178
frame #11: 0x000078aedfe8e39c libhsa-runtime64.so.1`___lldb_unnamed_symbol2402 + 5916
frame #12: 0x000078aedfe8f007 libhsa-runtime64.so.1`___lldb_unnamed_symbol2403 + 23
frame #13: 0x000078aedfe516db libhsa-runtime64.so.1`___lldb_unnamed_symbol1873 + 203
frame #14: 0x000078af49511749 libamdhip64.so.6`___lldb_unnamed_symbol3402 + 185
frame #15: 0x000078af4955387a libamdhip64.so.6`___lldb_unnamed_symbol3681 + 602
frame #16: 0x000078af492fcc79 libamdhip64.so.6`___lldb_unnamed_symbol2281 + 217
frame #17: 0x000078af492c35b3 libamdhip64.so.6`___lldb_unnamed_symbol2153 + 339
frame #18: 0x000078af4943d39f libamdhip64.so.6`___lldb_unnamed_symbol2689 + 159
frame #19: 0x000078af4940f2de libamdhip64.so.6`___lldb_unnamed_symbol2660 + 142
frame #20: 0x000078af4941071d libamdhip64.so.6`hipLaunchKernel + 749
frame #21: 0x0000568dec65e076 main`void __device_stub__k_bin_bcast<&op_mul(float, float), float, float, float>(float const*, float const*, float*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int) + 390
frame #22: 0x0000568dec65b8f6 main`void bin_bcast_cuda<&op_mul(float, float)>::operator()<float, float, float>(this=0x00007fffb4b0fa5f, src0=0x0000568e109fead0, src1=0x0000568e0fde5540, dst=0x0000568e109fec60, src0_dd=0x000078ad44601000, src1_dd=0x000078ab0968fa80, dst_dd=0x000078ad44601000, stream=0x0000568e10505f20) at binbcast.cu:234:17
frame #23: 0x0000568dec652172 main`void ggml_cuda_op_bin_bcast<bin_bcast_cuda<&op_mul(float, float)>>(src0=0x0000568e109fead0, src1=0x0000568e0fde5540, dst=0x0000568e109fec60, src0_dd=0x000078ad44601000, src1_dd=0x000078ab0968fa80, dst_dd=0x000078ad44601000, stream=0x0000568e10505f20) at binbcast.cu:254:9
frame #24: 0x0000568dec65209b main`ggml_cuda_op_mul(ctx=0x0000568e10350220, dst=0x0000568e109fec60) at binbcast.cu:275:5
frame #25: 0x0000568dec645dc5 main`ggml_cuda_compute_forward(ctx=0x0000568e10350220, dst=0x0000568e109fec60) at ggml-cuda.cu:2186:13
frame #26: 0x0000568dec645409 main`ggml_backend_cuda_graph_compute(backend=0x0000568e1058eb80, cgraph=0x0000568e0fe640f0) at ggml-cuda.cu:2434:19
@jdecourval just tried your suggestions (I also know nothing about HIP/CUDA), but ChatGPT has a similar proposal for hmax2. And for me it compiles and works. I tried:
./main --color --chatml -ngl 75 -t 12 -tb 1 -c 2048 -m ../models/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf -p "What is the meaning of life?"
with the output:
Log start
main: build = 2813 (ac6ae5da)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed = 1713629223
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from ../models/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = nurtureai_openhermes-2.5-mistral-7b-16k
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 100000.000000
llama_model_loader: - kv 11: general.file_type u32 = 17
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q5_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32002
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 100000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.78 GiB (5.67 BPW)
llm_load_print_meta: general.name = nurtureai_openhermes-2.5-mistral-7b-16k
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: ROCm0 buffer size = 4807.06 MiB
llm_load_tensors: CPU buffer size = 85.94 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 100000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 166.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1031
llama_new_context_with_model: graph splits = 2
system_info: n_threads = 12 (n_threads_batch = 1) / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
main: interactive mode on.
Reverse prompt: '<|im_start|>user
'
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 2048, n_batch = 2048, n_predict = -1, n_keep = 12
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
<|im_start|>system
What is the meaning of life?<|im_end|>
<|im_start|>user
>
The meaning of life is a philosophical question that has been debated for centuries, and there is no universally agreed upon answer. It depends on one's personal beliefs, values, and cultural context. Some people believe that the meaning of life is to find happiness, fulfillment, or spiritual enlightenment, while others may find purpose in contributing to society or achieving personal goals. Ultimately, the meaning of life is subjective and varies from person to person.<|im_end|>
>
Thanks @tbocek, it was indeed something on my side, probably a leftover from another test.
Having the code compile is a nice improvement, but it still doesn't work here if I add the --flash_attn
flag:
/home/jerome/Prog/online/llama.cpp/ggml-cuda/common.cuh:375: ERROR: HIP kernel warp_reduce_sum has no device code compatible with HIP arch 1300.
It comes from the fact that half2 warp_reduce_sum(half2 a);
is behind a preprocessor test in common.cuh
for HIP: https://github.com/ggerganov/llama.cpp/pull/6773/files#diff-7374e791d77a6a0f981b93a9985b7bbb3584fedf2b4ed18c796eca66a27a7917R367
However, simply removing the check does make it work!
--- a/ggml-cuda/common.cuh (revision 21b0bf477a56122d8302b218579956b034deaa36)
+++ b/ggml-cuda/common.cuh (date 1713632170170)
@@ -364,16 +364,11 @@
}
static __device__ __forceinline__ half2 warp_reduce_sum(half2 a) {
-#if !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_PASCAL
#pragma unroll
for (int mask = 16; mask > 0; mask >>= 1) {
a = __hadd2(a, __shfl_xor_sync(0xffffffff, a, mask, 32));
}
return a;
-#else
- GGML_UNUSED(a);
- NO_DEVICE_CODE;
-#endif // !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_PASCAL
}
static __device__ __forceinline__ float warp_reduce_max(float x) {
@JohannesGaessler do you know why that check is there?
That check is there because that is the old check for FlashAttention and I forgot to change it.
Here are some results. Token generation is about 2-3% faster.
❯ ./bin/llama-bench -m ggml-c4ai-command-r-35b-v01-iq4_xs.gguf -m openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf -m Nous-Hermes-2-Mixtral-8x7B-DPO.i1-IQ3_XXS.gguf -ngl 99 --flash-attn 1 --flash-attn 0 -r 10
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| command-r 35B IQ4_XS - 4.25 bpw | 19.48 GiB | 37.08 B | ROCm | 99 | 1 | pp 512 | 744.02 ± 1.26 |
| command-r 35B IQ4_XS - 4.25 bpw | 19.48 GiB | 37.08 B | ROCm | 99 | 1 | tg 128 | 38.04 ± 0.02 |
| command-r 35B IQ4_XS - 4.25 bpw | 19.48 GiB | 37.08 B | ROCm | 99 | 0 | pp 512 | 737.22 ± 1.69 |
| command-r 35B IQ4_XS - 4.25 bpw | 19.48 GiB | 37.08 B | ROCm | 99 | 0 | tg 128 | 37.60 ± 0.01 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | 1 | pp 512 | 3269.33 ± 23.07 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | 1 | tg 128 | 91.15 ± 0.02 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | 0 | pp 512 | 3243.98 ± 5.28 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | 0 | tg 128 | 88.92 ± 0.01 |
| llama 8x7B IQ3_XXS - 3.0625 bpw | 33.27 GiB | 91.80 B | ROCm | 99 | 1 | pp 512 | 1190.03 ± 2.42 |
| llama 8x7B IQ3_XXS - 3.0625 bpw | 33.27 GiB | 91.80 B | ROCm | 99 | 1 | tg 128 | 55.40 ± 0.17 |
| llama 8x7B IQ3_XXS - 3.0625 bpw | 33.27 GiB | 91.80 B | ROCm | 99 | 0 | pp 512 | 1187.23 ± 4.88 |
| llama 8x7B IQ3_XXS - 3.0625 bpw | 33.27 GiB | 91.80 B | ROCm | 99 | 0 | tg 128 | 54.11 ± 0.05 |
I pushed all my fixes to this branch: https://github.com/jdecourval/llama.cpp/tree/fixflashattn
If it's only token generation that is faster then this PR is pretty much pointless because the FlashAttention kernel for batch size 1 does not use tensor cores at all (except for Phi-2). Also can you check llama-bench with -p 4096
? That should be more sensitive to changes to the attention since the context is larger.
@jdecourval I applied your patches. Here are my results with -p 4096:
./llama-bench -m ../models/codellama-7b.Q4_K_M.gguf -m ../models/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf -ngl 99 --flash-attn 1 --flash-attn 0 -r 10 -p 4096
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
warning: debug build, performance may be affected
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | ROCm | 99 | 1 | pp 4096 | 2512.43 ± 7.36 |
| llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | ROCm | 99 | 1 | tg 128 | 90.73 ± 0.20 |
| llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | ROCm | 99 | 0 | pp 4096 | 2424.13 ± 2.14 |
| llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | ROCm | 99 | 0 | tg 128 | 89.74 ± 0.05 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | 1 | pp 4096 | 2330.66 ± 3.34 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | 1 | tg 128 | 84.18 ± 0.04 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | 0 | pp 4096 | 2285.69 ± 2.18 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | 0 | tg 128 | 82.02 ± 0.07 |
build: ac6ae5da (2813)
If it's only token generation that is faster then this PR is pretty much pointless because the FlashAttention kernel for batch size 1 does not use tensor cores at all (except for Phi-2). Also can you check llama-bench with
-p 4096
? That should be more sensitive to changes to the attention since the context is larger.
And here are mine
model | size | params | backend | ngl | fa | test | t/s |
---|---|---|---|---|---|---|---|
command-r 35B IQ3_XS - 3.3 bpw | 15.65 GiB | 37.08 B | ROCm | 99 | 1 | pp 4096 | 659.82 ± 0.86 |
command-r 35B IQ3_XS - 3.3 bpw | 15.65 GiB | 37.08 B | ROCm | 99 | 1 | tg 128 | 28.86 ± 0.00 |
command-r 35B IQ3_XS - 3.3 bpw | 15.65 GiB | 37.08 B | ROCm | 99 | 0 | pp 4096 | 640.81 ± 0.28 |
command-r 35B IQ3_XS - 3.3 bpw | 15.65 GiB | 37.08 B | ROCm | 99 | 0 | tg 128 | 28.58 ± 0.02 |
llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | 1 | pp 4096 | 2671.34 ± 2.17 |
llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | 1 | tg 128 | 91.06 ± 0.03 |
llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | 0 | pp 4096 | 2615.97 ± 1.84 |
llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | 0 | tg 128 | 89.34 ± 0.04 |
llama 8x7B IQ3_XXS - 3.0625 bpw | 33.27 GiB | 91.80 B | ROCm | 99 | 1 | pp 4096 | 1097.23 ± 1.43 |
llama 8x7B IQ3_XXS - 3.0625 bpw | 33.27 GiB | 91.80 B | ROCm | 99 | 1 | tg 128 | 55.27 ± 0.05 |
llama 8x7B IQ3_XXS - 3.0625 bpw | 33.27 GiB | 91.80 B | ROCm | 99 | 0 | pp 4096 | 1091.82 ± 2.33 |
llama 8x7B IQ3_XXS - 3.0625 bpw | 33.27 GiB | 91.80 B | ROCm | 99 | 0 | tg 128 | 54.30 ± 0.02 |
llama 7B Q6_K | 5.53 GiB | 7.24 B | ROCm | 99 | 1 | pp 4096 | 2698.18 ± 1.79 |
llama 7B Q6_K | 5.53 GiB | 7.24 B | ROCm | 99 | 1 | tg 128 | 89.60 ± 0.08 |
llama 7B Q6_K | 5.53 GiB | 7.24 B | ROCm | 99 | 0 | pp 4096 | 2650.79 ± 2.95 |
llama 7B Q6_K | 5.53 GiB | 7.24 B | ROCm | 99 | 0 | tg 128 | 87.52 ± 0.02 |
llama ?B Q4_K - Small | 17.59 GiB | 33.34 B | ROCm | 99 | 1 | pp 4096 | 680.71 ± 0.58 |
llama ?B Q4_K - Small | 17.59 GiB | 33.34 B | ROCm | 99 | 1 | tg 128 | 28.17 ± 0.02 |
llama ?B Q4_K - Small | 17.59 GiB | 33.34 B | ROCm | 99 | 0 | pp 4096 | 649.49 ± 0.40 |
llama ?B Q4_K - Small | 17.59 GiB | 33.34 B | ROCm | 99 | 0 | tg 128 | 27.55 ± 0.01 |
llama 70B IQ1_M - 1.75 bpw | 15.59 GiB | 70.55 B | ROCm | 99 | 1 | pp 4096 | 335.77 ± 0.31 |
llama 70B IQ1_M - 1.75 bpw | 15.59 GiB | 70.55 B | ROCm | 99 | 1 | tg 128 | 26.80 ± 0.01 |
llama 70B IQ1_M - 1.75 bpw | 15.59 GiB | 70.55 B | ROCm | 99 | 0 | pp 4096 | 327.34 ± 0.14 |
llama 70B IQ1_M - 1.75 bpw | 15.59 GiB | 70.55 B | ROCm | 99 | 0 | tg 128 | 26.25 ± 0.01 |
Updated with more models
My current stance is that I don't think that a speedup of 2% is large enough to justify adding a dependency, especially when there is no dev with the ability to test and support the implementation themselves. I think right now it makes more sense to just disable FlashAttention for AMD. One of my next goals is to write kernels that don't use tensor cores at all. It may turn out that those are faster on AMD than rocWMMA anyways.
Alright. Thank you very much for the help.
I will update the target branch to disable flash attention when HIP is enabled for now
My current stance is that I don't think that a speedup of 2% is large enough to justify adding a dependency, especially when there is no dev with the ability to test and support the implementation themselves. I think right now it makes more sense to just disable FlashAttention for AMD. One of my next goals is to write kernels that don't use tensor cores at all. It may turn out that those are faster on AMD than rocWMMA anyways.
I agree that a 2% performance improvement is not much, but the 500-600MB VRAM reduction may be significant: https://github.com/ggerganov/llama.cpp/pull/5021#issuecomment-2068039849
On Arch Linux at least, I had to manually install the rocWMMA dependency from github, as there is no current library (also not in AUR).
On Arch Linux at least, I had to manually install the rocWMMA dependency from github, as there is no current library (also not in AUR).
There's this one: https://aur.archlinux.org/packages/rocwmma
But I manually updated pkgver
inside the PKGBUILD to current ArchLinux's ROCm version, 6.0.2.
Obsolete now that https://github.com/ggerganov/llama.cpp/pull/5021 has been merged.