llama.cpp
llama.cpp copied to clipboard
AMD MI300 GPU (gfx942) is lower than expected
Question
MI300 (gfx942) supposed to be faster, but only receive 11.93 tokens per second
Here is my inference command
./main -m ./models/llama-2-7b-chat.Q2_K.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
I am running ubuntu docker rocm:pytorch/latest ON OS
OS :
Linux version 6.2.0-35-generic (buildd@bos03-amd64-016) (x86_64-linux-gnu-gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Oct 6 10:23:26 UTC 2
llama_print_timings: load time = 3050.38 ms
llama_print_timings: sample time = 3.54 ms / 120 runs ( 0.03 ms per token, 33869.60 tokens per second)
llama_print_timings: prompt eval time = 176.38 ms / 19 tokens ( 9.28 ms per token, 107.73 tokens per second)
llama_print_timings: eval time = 9971.03 ms / 119 runs ( 83.79 ms per token, 11.93 tokens per second)
llama_print_timings: total time = 10174.64 ms / 138 tokens
Here is my build option
make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx942
you forgot "-ngl" to offload layers to the GPU
thank you! @timschwartz but when I test with ./main -m ./models/llama-2-7b-chat.Q2_K.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33
seems no difference
Can you paste the whole output of main?
There are a few things I can think of that could be slowing you down.
First is LLAMA_HIP_UMA=1
is for integrated graphics in the CPU, and will slow down actual GPUs because it uses RAM instead of VRAM, so try compiling without that.
Next, gfx942 isn't included in the polyfill for __dp4a()
so the compiler might be making slow choices like it did for gfx11 GPUs https://github.com/ggerganov/llama.cpp/issues/3701
https://github.com/ggerganov/llama.cpp/blob/0d56246f4b9764158525d894b96606f6163c53a8/ggml-cuda/common.cuh#L366-L367
Finally, llama.cpp can either dequantize weights and use hipblas, or use its own mul_mat_q kernels for quantized matrix multiplication. gfx942 is going down the same code path as vega, so I think it would be worth changing the checks in ggml_cuda_mul_mat()
in ggml-cuda.cu to see the speed difference with gfx942 treated the same way as RDNA3 using hipblas.
n_threads = 112 / 224
Test --threads N
. I don't know what's optimal for your system, usually it's best to start at 1, then see if token generation speed improves.
Here's how to verify the CPU is not oversaturated: https://github.com/ggerganov/llama.cpp/blob/f87f7b898651339fe173ddf016ca826163e899d8/docs/token_generation_performance_tips.md#verifying-that-the-cpu-is-not-oversaturated
Hey, @xinyuli1204
However fast or slow MI300x is, you seem to be a lucky user of a live instance! Right now, I'm trying to figure out how it will behave with Llama3 and some other models from Huggingface. If you're using a VM, would you mind sharing the name of the service you managed to rent it from? Also, there seems to be general lack of information about the performance of this type of GPU. Did you happen to run any other experiments on it?