Forkoz comments

Results 474 comments of


                                            Forkoz

Add LoRA fine-tuning to AWQ

Having the base model becomes unmanageable with 70b+, that's part of the issue. They're 160gb+

[Bug]: Flash attention cannot be used on v0.5.3

I have flash attention installed and compiled it from source to support new torch but it still says it isn't found. Will double check it. I recompiled it again after...

[Bug]: Flash attention cannot be used on v0.5.3

I thought VLLM supported a triton based FA for all (tensor) cards, I was hoping to try it here but instead it used the normal FA package.

[Bug]: Cannot load 70b exl2 5bpw model across 4 GPUs.

I got further when I loaded a gptq model. It turns out you have to specify quantization or else you will get an OOM. This isn't very intuitive. Unfortunately I'm...

[Bug]: Cannot load 70b exl2 5bpw model across 4 GPUs.

It's only 2gb less than a 3090. Compute wise yea, it's a bit slower. When used with pure exllama or other engines the hit isn't that bad. When I try...

[Bug]: Cannot load 70b exl2 5bpw model across 4 GPUs.

So to compile I need to do it in 11.8 still? I am using 12.x conda and I had trouble. It wasn't able to find ninja despite it being installed...

[Bug]: Cannot load 70b exl2 5bpw model across 4 GPUs.

classic exllamav2 lets you fit that context. for some reason aphrodite uses more.

Full use of dual GPU

Experiment with max memory and set it lower until it fills more of the other GPU first. It will come back around once it reaches your 2nd gpu's limit.

Full use of dual GPU

Keep trying lower limits until you get less on the first GPU. It can be like 15gb even.

It just makes one image and sits [t2v]

That's true but all the issues I searched had other errors. Any idea which commits did it? Maybe there is a fix that can be made.