Shouyi
Shouyi
Are you interested in implementing this new algorithm in Llama.cpp? The performance with 3 bits seems amazing
This is a known Metal issue that they are unable to allocate enough memory for the GPU. The current limit is approximately half of the physical memory available.
@CyborgArmy83 A fix may be possible in the future. Actually using CPU inference is not significantly slower. And it is not a waste of money for your M2 Max. The...
@CyborgArmy83 Yeah, for M2 Max, the GPU (38 core) is almost 2 times faster. But for basic M1/M2 and M1/M2 Pro, GPU and CPU inference speed is the same. Many...
@CyborgArmy83 Hey, could you give the latest code in the master branch a try and see if it solves your problem? While you're at it, could you also check the...
@CyborgArmy83 https://developer.apple.com/videos/play/tech-talks/10580/?time=546 Based on the video, it appears that 64GB Macs have 48GB (75%) of usable memory for the GPU. This should solve your problem. We still have issues because...
@CyborgArmy83 Can you please try the latest software and tell me the output? That helps a lot. Thank you so much!
Yes, I just tested it. Splitting a 33b model between two GPUs resulted in an additional 1.5GB of VRAM usage.
> 收到,后面考虑下 强烈需要这个功能。本来对翻译软件没要求那么多,但是openAI translator等新工具把这个功能做出来以后,就变得必不可少了。