Forkoz
Forkoz
You all have correct GPU accelerated bits and bytes on windows, right? And proper cuda using torch/etc... Because it sounds like it isn't so.
This model already works. It is just pythia-12b fine tuned.
So this is why I couldn't load the models after I fixed the ) bug. But now we can quantize in different group size. Which one is the best for...
6 is your entire GPU, leave some room for the browser and windows/xorg/etc.
Can we use value like 3.5 for this? I only tried with solid numbers but it sticks when I put --gpu-memory 20 or 22. It used to go over before...
Make sure to clear context and use the exact same prompts/settings. Preferably in some mode where you get the exact same response back. I.e. Disable do_sample. Otherwise it gets really...
Try the mathmul kernels, they use less vram.
So now there is NEO-X and GPTJ and Regular GPTQ (opt,llama,bloom).. all separate repo with separate kernels. And I think the kernels may not work across versions and all require...
Transformers/Accelerate does this in CPU mode too? Ouch. edit: Hey.. so a shot in the dark, did you try with https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176 That transformer would use multiple cores for me for...
You can't use hugging face without generating a login token. You have to d/l those files manually.