llama.cpp
llama.cpp copied to clipboard
[fixed]The last code build with memory fix running result is not good in my pc.
Be obviously slower with Q_1 30b model. And the memory usage become garbage... (Linux 5.19 x64 Ubuntu base)
Seems the flag n makes different. n=100000 no long work🤔. But After change n to 4096 also not working...
Seems the flag n makes different. n=100000 no long work🤔. After change n to 4096 also not working.
Rollback to my back up...
some of the last commits changed/touched how memory is handled.
also there is -c
you can set up to 2048.
Kinda similar case here, although im unsure which specific commit caused performance loss
After swapping out old exe for new one, I went from 207 ms per token to 269 on 13b alpaca I suspect impact might be more noticable on 30b and 65b models in this case
some of the last commits changed/touched how memory is handled.
also there is
-c
you can set up to 2048.
Always. -c works fine even with 5000+ in my pc😅 so I guess there be somewhere else problems.
Kinda similar case here, although im unsure which specific commit caused performance loss
After swapping out old exe for new one, I went from 207 ms per token to 269 on 13b alpaca
I suspect impact might be more noticable on 30b and 65b models in this case
Yes. In 30b I was enable to chat with more than half an hour, now with less than 20 tokens the ggml show memory not enough.
@FNsi please try again with latest master.
@FNsi please try again with latest master.
Ban the Blas make it works, still have little performance loss anyway.
a guess from me, is it because the blas try to make 4 bit prompt back to 16 bit in run time...? 😅😂
It's not the 4-bits - it does not work with F16 either. I am almost sure that this F16 BLAS call is somehow wrong (as well as the rest of them):
https://github.com/ggerganov/llama.cpp/blob/8520fc310eab87f2c4612f2a00d4adbd44a20d0d/ggml.c#L6244-L6250
Which is super strange since this has been used in whisper.cpp
forever and it seems to work..
It's not the 4-bits - it does not work with F16 either.
I am almost sure that this F16 BLAS call is somehow wrong:
https://github.com/ggerganov/llama.cpp/blob/8520fc310eab87f2c4612f2a00d4adbd44a20d0d/ggml.c#L6244-L6250
Which is super strange since this has been used in
whisper.cpp
forever and it seems to work..
So, it seems mean that, a big performance improvement is coming as if it's been figured out.