Forkoz comments

Results 474 comments of


                                            Forkoz

How to get this working with an AMD GPU?

FYI, Rocm also fails if you don't have PCIE3.0 atomics. So I am screwed on linux for my old box. If you want bits&bytes for AMD.. it got set up...

[in progress] fp16 memory optimizations

Basically I tested this and found that there is no difference beyond just changing the KV_CACHE to fp16. Using the autocasting and stuff like that gives no benefit that I...

[in progress] fp16 memory optimizations

There's a vllm one that will work for all tensor core cards: https://github.com/vllm-project/vllm/blob/main/vllm/attention/ops/triton_flash_attention.py current one only supports ampere+ Not sure how to wrap it around your forwards.

Unable to install on windows 11 22H2 (OS Build 25290.1010)

Curl says it failed to download it. It do not have write permissions to write to where you download it.

Unable to install on windows 11 22H2 (OS Build 25290.1010)

Try running it elevated?

Unable to install on windows 11 22H2 (OS Build 25290.1010)

Try doing the steps manually. Install a different conda.

error when starting the 4bit model

Don't forget to recompile the cuda kernel `python setup_cuda.py install`

error when starting the 4bit model

The model he is loading won't work with the current GPTQ afaik.

inconsistencies in memory allocation arguments depending on CPU use

Try GiB too.

Add support for the latest GPTQ models with group-size

I tried loading old 4 bit models with -1 and it would error. Thought everything has to be re-quantized for it to work with the new GPTQ.