Forkoz
Forkoz
FYI, Rocm also fails if you don't have PCIE3.0 atomics. So I am screwed on linux for my old box. If you want bits&bytes for AMD.. it got set up...
Basically I tested this and found that there is no difference beyond just changing the KV_CACHE to fp16. Using the autocasting and stuff like that gives no benefit that I...
There's a vllm one that will work for all tensor core cards: https://github.com/vllm-project/vllm/blob/main/vllm/attention/ops/triton_flash_attention.py current one only supports ampere+ Not sure how to wrap it around your forwards.
Curl says it failed to download it. It do not have write permissions to write to where you download it.
Try running it elevated?
Try doing the steps manually. Install a different conda.
Don't forget to recompile the cuda kernel `python setup_cuda.py install`
The model he is loading won't work with the current GPTQ afaik.
I tried loading old 4 bit models with -1 and it would error. Thought everything has to be re-quantized for it to work with the new GPTQ.