Shouyi comments

Results 9 comments of


                                            Shouyi

AWQ: Activation-aware Weight Quantization???

Are you interested in implementing this new algorithm in Llama.cpp? The performance with 3 bits seems amazing

[User] GPU Memory problem on Apple M2 Max 64GB

This is a known Metal issue that they are unable to allocate enough memory for the GPU. The current limit is approximately half of the physical memory available.

[User] GPU Memory problem on Apple M2 Max 64GB

@CyborgArmy83 A fix may be possible in the future. Actually using CPU inference is not significantly slower. And it is not a waste of money for your M2 Max. The...

[User] GPU Memory problem on Apple M2 Max 64GB

@CyborgArmy83 Yeah, for M2 Max, the GPU (38 core) is almost 2 times faster. But for basic M1/M2 and M1/M2 Pro, GPU and CPU inference speed is the same. Many...

[User] GPU Memory problem on Apple M2 Max 64GB

@CyborgArmy83 Hey, could you give the latest code in the master branch a try and see if it solves your problem? While you're at it, could you also check the...

[User] GPU Memory problem on Apple M2 Max 64GB

@CyborgArmy83 https://developer.apple.com/videos/play/tech-talks/10580/?time=546 Based on the video, it appears that 64GB Macs have 48GB (75%) of usable memory for the GPU. This should solve your problem. We still have issues because...

[User] GPU Memory problem on Apple M2 Max 64GB

@CyborgArmy83 Can you please try the latest software and tell me the output? That helps a lot. Thank you so much!

Question: Does GPU splitting take more ram than running on a single GPU?

Yes, I just tested it. Splitting a 33b model between two GPUs resulted in an additional 1.5GB of VRAM usage.

在翻译后，重新选择原始词的某部分，让chatgpt告诉该词在整句中的意思和用法

> 收到，后面考虑下强烈需要这个功能。本来对翻译软件没要求那么多，但是openAI translator等新工具把这个功能做出来以后，就变得必不可少了。