llama.cpp
llama.cpp copied to clipboard
CUDA: app option to compile without FlashAttention
Fixes https://github.com/ggml-org/llama.cpp/issues/11946 .
I added an option GGML_CUDA_NO_FA
that is used for CUDA, HIP, and MUSA. Two more general questions for compile options:
- Do we have guidelines regarding whether
ON
andOFF
are relative to a feature being enabled or relative to the default compilation option? Basically, instead ofGGML_CUDA_NO_FA=OFF
I could have madeGGML_CUDA_FA=ON
the default. - Do we have guidelines regarding whether
CUDA
compilation options should be used for HIP and MUSA? I noticed that e.g.GGML_CUDA_FORCE_CUBLAS
is used for all three but instead ofGGML_CUDA_NO_VMM
there is e.g.GGML_HIP_NO_VMM
.