llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

[fixed]The last code build with memory fix running result is not good in my pc.

Open FNsi opened this issue 1 year ago • 10 comments

Be obviously slower with Q_1 30b model. And the memory usage become garbage... (Linux 5.19 x64 Ubuntu base)

FNsi avatar Mar 24 '23 14:03 FNsi

Seems the flag n makes different. n=100000 no long work🤔. But After change n to 4096 also not working...

FNsi avatar Mar 24 '23 14:03 FNsi

Seems the flag n makes different. n=100000 no long work🤔. After change n to 4096 also not working.

Rollback to my back up...

FNsi avatar Mar 24 '23 14:03 FNsi

some of the last commits changed/touched how memory is handled. also there is -c you can set up to 2048.

Green-Sky avatar Mar 24 '23 14:03 Green-Sky

Kinda similar case here, although im unsure which specific commit caused performance loss

After swapping out old exe for new one, I went from 207 ms per token to 269 on 13b alpaca I suspect impact might be more noticable on 30b and 65b models in this case

x02Sylvie avatar Mar 24 '23 14:03 x02Sylvie

some of the last commits changed/touched how memory is handled.

also there is -c you can set up to 2048.

Always. -c works fine even with 5000+ in my pc😅 so I guess there be somewhere else problems.

FNsi avatar Mar 24 '23 14:03 FNsi

Kinda similar case here, although im unsure which specific commit caused performance loss

After swapping out old exe for new one, I went from 207 ms per token to 269 on 13b alpaca

I suspect impact might be more noticable on 30b and 65b models in this case

Yes. In 30b I was enable to chat with more than half an hour, now with less than 20 tokens the ggml show memory not enough.

FNsi avatar Mar 24 '23 14:03 FNsi

@FNsi please try again with latest master.

Green-Sky avatar Mar 24 '23 21:03 Green-Sky

@FNsi please try again with latest master.

Ban the Blas make it works, still have little performance loss anyway.

a guess from me, is it because the blas try to make 4 bit prompt back to 16 bit in run time...? 😅😂

FNsi avatar Mar 25 '23 03:03 FNsi

It's not the 4-bits - it does not work with F16 either. I am almost sure that this F16 BLAS call is somehow wrong (as well as the rest of them):

https://github.com/ggerganov/llama.cpp/blob/8520fc310eab87f2c4612f2a00d4adbd44a20d0d/ggml.c#L6244-L6250

Which is super strange since this has been used in whisper.cpp forever and it seems to work..

ggerganov avatar Mar 25 '23 04:03 ggerganov

It's not the 4-bits - it does not work with F16 either.

I am almost sure that this F16 BLAS call is somehow wrong:

https://github.com/ggerganov/llama.cpp/blob/8520fc310eab87f2c4612f2a00d4adbd44a20d0d/ggml.c#L6244-L6250

Which is super strange since this has been used in whisper.cpp forever and it seems to work..

So, it seems mean that, a big performance improvement is coming as if it's been figured out.

FNsi avatar Mar 25 '23 04:03 FNsi