llama.cpp CUDA: app option to compile without FlashAttention

CUDA: app option to compile without FlashAttention

Open JohannesGaessler opened this issue 9 hours ago • 0 comments

Fixes https://github.com/ggml-org/llama.cpp/issues/11946 .

I added an option GGML_CUDA_NO_FA that is used for CUDA, HIP, and MUSA. Two more general questions for compile options:

Do we have guidelines regarding whether ON and OFF are relative to a feature being enabled or relative to the default compilation option? Basically, instead of GGML_CUDA_NO_FA=OFF I could have made GGML_CUDA_FA=ON the default.
Do we have guidelines regarding whether CUDA compilation options should be used for HIP and MUSA? I noticed that e.g. GGML_CUDA_FORCE_CUBLAS is used for all three but instead of GGML_CUDA_NO_VMM there is e.g. GGML_HIP_NO_VMM.

Feb 22 '25 13:02 JohannesGaessler