flash-attention llama_new_context_with_model: flash_attn is not compatible with attn_soft_cap

llama_new_context_with_model: flash_attn is not compatible with attn_soft_cap - forcing off

Open sysuls1 opened this issue 7 months ago • 4 comments

Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

CUDA_VISIBLE_DEVICES=0 ./llama-server --host 0.0.0.0 --port 8008 -m /home/kemove/model/gemma-2-27b-it-Q5_K_S.gguf -ngl 99 -t 4 -np 4 -ns 4 -c 512 -fa

Jul 22 '24 09:07 sysuls1