Question

MI300 (gfx942) supposed to be faster, but only receive 11.93 tokens per second Here is my inference command ./main -m ./models/llama-2-7b-chat.Q2_K.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e I am running ubuntu docker rocm:pytorch/latest ON OS OS : Linux version 6.2.0-35-generic (buildd@bos03-amd64-016) (x86_64-linux-gnu-gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Oct 6 10:23:26 UTC 2

llama_print_timings:        load time =    3050.38 ms
llama_print_timings:      sample time =       3.54 ms /   120 runs   (    0.03 ms per token, 33869.60 tokens per second)
llama_print_timings: prompt eval time =     176.38 ms /    19 tokens (    9.28 ms per token,   107.73 tokens per second)
llama_print_timings:        eval time =    9971.03 ms /   119 runs   (   83.79 ms per token,    11.93 tokens per second)
llama_print_timings:       total time =   10174.64 ms /   138 tokens

Here is my build option make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx942

Apr 18 '24 19:04 xinyuli1204

you forgot "-ngl" to offload layers to the GPU

Apr 18 '24 19:04 timschwartz

thank you! @timschwartz but when I test with ./main -m ./models/llama-2-7b-chat.Q2_K.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 seems no difference

Apr 18 '24 20:04 xinyuli1204

Can you paste the whole output of main?

Apr 18 '24 20:04 timschwartz

There are a few things I can think of that could be slowing you down.

First is LLAMA_HIP_UMA=1 is for integrated graphics in the CPU, and will slow down actual GPUs because it uses RAM instead of VRAM, so try compiling without that.

Next, gfx942 isn't included in the polyfill for __dp4a() so the compiler might be making slow choices like it did for gfx11 GPUs https://github.com/ggerganov/llama.cpp/issues/3701 https://github.com/ggerganov/llama.cpp/blob/0d56246f4b9764158525d894b96606f6163c53a8/ggml-cuda/common.cuh#L366-L367

Finally, llama.cpp can either dequantize weights and use hipblas, or use its own mul_mat_q kernels for quantized matrix multiplication. gfx942 is going down the same code path as vega, so I think it would be worth changing the checks in ggml_cuda_mul_mat() in ggml-cuda.cu to see the speed difference with gfx942 treated the same way as RDNA3 using hipblas.

Apr 18 '24 21:04 Engininja2

n_threads = 112 / 224

Test --threads N. I don't know what's optimal for your system, usually it's best to start at 1, then see if token generation speed improves.

Here's how to verify the CPU is not oversaturated: https://github.com/ggerganov/llama.cpp/blob/f87f7b898651339fe173ddf016ca826163e899d8/docs/token_generation_performance_tips.md#verifying-that-the-cpu-is-not-oversaturated

Apr 18 '24 21:04 Jeximo

Hey, @xinyuli1204

However fast or slow MI300x is, you seem to be a lucky user of a live instance! Right now, I'm trying to figure out how it will behave with Llama3 and some other models from Huggingface. If you're using a VM, would you mind sharing the name of the service you managed to rent it from? Also, there seems to be general lack of information about the performance of this type of GPU. Did you happen to run any other experiments on it?

May 08 '24 16:05 frolovconst

llama.cpp
llama.cpp copied to clipboard

AMD MI300 GPU (gfx942) is lower than expected

Question

llama.cpp llama.cpp copied to clipboard

AMD MI300 GPU (gfx942) is lower than expected

Question

llama.cpp
llama.cpp copied to clipboard