Forkoz
Forkoz
Having the base model becomes unmanageable with 70b+, that's part of the issue. They're 160gb+
I have flash attention installed and compiled it from source to support new torch but it still says it isn't found. Will double check it. I recompiled it again after...
I thought VLLM supported a triton based FA for all (tensor) cards, I was hoping to try it here but instead it used the normal FA package.
I got further when I loaded a gptq model. It turns out you have to specify quantization or else you will get an OOM. This isn't very intuitive. Unfortunately I'm...
It's only 2gb less than a 3090. Compute wise yea, it's a bit slower. When used with pure exllama or other engines the hit isn't that bad. When I try...
So to compile I need to do it in 11.8 still? I am using 12.x conda and I had trouble. It wasn't able to find ninja despite it being installed...
classic exllamav2 lets you fit that context. for some reason aphrodite uses more.
Experiment with max memory and set it lower until it fills more of the other GPU first. It will come back around once it reaches your 2nd gpu's limit.
Keep trying lower limits until you get less on the first GPU. It can be like 15gb even.
That's true but all the issues I searched had other errors. Any idea which commits did it? Maybe there is a fix that can be made.