flash-attention
flash-attention copied to clipboard
llama_new_context_with_model: flash_attn is not compatible with attn_soft_cap - forcing off
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
CUDA_VISIBLE_DEVICES=0 ./llama-server --host 0.0.0.0 --port 8008 -m /home/kemove/model/gemma-2-27b-it-Q5_K_S.gguf -ngl 99 -t 4 -np 4 -ns 4 -c 512 -fa