Forkoz
Forkoz
I just saw..I loaded it afterwards and got the cuda assert. Then I turn off fused attention and it loads but the generation feels really slow. ok.. nvm.. I think...
It's choking down at 3k context. To 3-4it/s even. The merged copy shouldn't have this problem but its 128g with no act order and I wanted to try the original...
Qlora has mildly better perplexity and that probably carries over to training. But as you say, the same modules can be targeted and trained faster. You still sort of need...
Uses AWQ. I wonder about perplexity and memory performance of that format vs GPTQ.
The paper probably doesn't compare optimized exllama at 64G. Remember the SPQR paper doing similar. Noticed a lot of authors do very favorable results in their graphs and creatively omit...
ime airoboros doesn't use compress_pos_embed and I found the best perplexity was obtained using alpha 2.7. the default 100k base gives lower results when I ran it as a lora....
Your driver is probably fine. It's the venv. I too have cuda 12 on the system and then cuda 11.8 in the venv. I had to download all the cuda...
This is why I like conda. A fresh environment with new cu118 torch and reqs usually fixes things. Although I've yet to mess up a single conda env or venv,...
You need to install `cannot find -lespeak` the lib for espeak.
ime, triton was never faster for anything. exclusionary high compute requirements and slower speed, oh my. The only one who has pulled off merging adapters into quantized models is GGUF....