alpaca.cpp icon indicating copy to clipboard operation
alpaca.cpp copied to clipboard

Extremely slow response time from Prompts

Open Bigcow11 opened this issue 1 year ago • 19 comments

Has anyone else had issues with the response taking forever on alpaca? Have the 13B version installed and operational; however, when prompted for an output the response is extremely slow.

For example: 5+ minutes to output text responding to the sample question of: "What are lists in python?". This is also apparent through terminal so not isolated to the web GUI.

Hardware is more than sufficient as well to run it alpacaspeed

Bigcow11 avatar Mar 22 '23 09:03 Bigcow11

Also have this issue, but IDK this is normal or not.

DartPower avatar Mar 22 '23 11:03 DartPower

Also have this issue, but IDK this is normal or not.

I don't believe its normal, on the readme page he includes the .gif of the speed which matches chat gtp3 speed

Bigcow11 avatar Mar 22 '23 21:03 Bigcow11

Same here. Running on 16 Core AMD CPU and is extremely slow.

freezah avatar Mar 22 '23 22:03 freezah

What hardware? Likely the only issue I can think of. I'm running 7B, 13B, and 30B on 32GB RAM and a beefy cpu. 7B is snappy, 13B is still fast, 30B takes a couple minutes to output the full answers.

trevtravtrev avatar Mar 23 '23 03:03 trevtravtrev

@trevtravtrev I’m using AMD 5950X (32 threads) with 128GB of RAM. Tried with 7B as well and also super slow. Even I run the “chat” with -t 32 parameter, I see mostly one thread being used at 100% and rest is kind of idling.

freezah avatar Mar 23 '23 07:03 freezah

Same with the basic 7B model

pnadj avatar Mar 23 '23 13:03 pnadj

+1 debian VM, 16 core & 64 gb memory 6.1.0-kali5-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.12-1kali2 (2023-02-23) x86_64 GNU/Linux

alpaca.cpp $ ./chat -t 16 -m ggml-alpaca-7b-q4.bin --interactive-start
main: seed = 1679691725
llama_model_load: loading model from 'ggml-alpaca-7b-q4.bin' - please wait ...
llama_model_load: ggml ctx size = 6065.34 MB
llama_model_load: memory_size =  2048.00 MB, n_mem = 65536
llama_model_load: loading model part 1/1 from 'ggml-alpaca-7b-q4.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: interactive mode on.
sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


== Running in chat mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMA.
 - If you want to submit another line, end your input in '\'.
>

ninp0 avatar Mar 23 '23 14:03 ninp0

@freezah I would suggest you using 30 threads, not 32. It is better to keep a few available for other processes, based on your system load.

KaruroChori avatar Mar 23 '23 19:03 KaruroChori

@KaruroChori I totally agree. I used 32 just to verify if the application will work better than with default 4 threads.

I just run it with 30 threads and observing same issue. Even though I see 30 forked processes in 'htop', only one of them consumes 100% of single thread, rest is consuming 3.3-4.0% on average.

Model loading is quick, but now it's already 10 minutes and the prompt still didn't show up:

llama_model_load: loading model from './ggml-alpaca-7b-q4.bin' - please wait ... llama_model_load: ggml ctx size = 6065.34 MB llama_model_load: memory_size = 2048.00 MB, n_mem = 65536 llama_model_load: loading model part 1/1 from './ggml-alpaca-7b-q4.bin' llama_model_load: .................................... done llama_model_load: model size = 4017.27 MB / num tensors = 291

system_info: n_threads = 30 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | main: interactive mode on. sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

== Running in chat mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to LLaMA.
  • If you want to submit another line, end your input in ''.

freezah avatar Mar 23 '23 19:03 freezah

What hardware? Likely the only issue I can think of. I'm running 7B, 13B, and 30B on 32GB RAM and a beefy cpu. 7B is snappy, 13B is still fast, 30B takes a couple minutes to output the full answers.

7B runs slow too, I have a 24 thread with 128gb of ram and dedicated gpus so it’s not the hardware bottle necking. There’s an issue with model

Bigcow11 avatar Mar 24 '23 08:03 Bigcow11

Have we found any solutions or successful work arounds?

Bigcow11 avatar Mar 24 '23 08:03 Bigcow11

I tested both the 7b and the 30b on a xeon 2650Lv4 128GB of ram and a ryzen 5950x with 64GB of ram. Performance are comparable between the two platform (between 2-3 times faster for the ryzen) and usable. I am not sure why some people are getting an extremely slow generation. When the process is running, all cores should be around 100%, not just one.

KaruroChori avatar Mar 24 '23 11:03 KaruroChori

@KaruroChori

Which host system distro, version, kernel you are using ?

freezah avatar Mar 24 '23 12:03 freezah

My uname -a:

Linux [redacted] 5.15.0-40-generic #43-Ubuntu SMP Wed Jun 15 12:54:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Linux [redacted] 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

KaruroChori avatar Mar 24 '23 13:03 KaruroChori

Same here. Have an AMD A10-6700 CPU and 32GB RAM. I try to run it on Ubuntu Ubuntu 20.04.6 and WSL2. My guess is that some speed optimizations like AVX or NEON do not work with AMD processors.

Do AMD processers have other, similar speedup flags, which must be used instead?

kaiserfr avatar Mar 25 '23 15:03 kaiserfr

Same problem here

Santiagoc57 avatar Mar 26 '23 17:03 Santiagoc57

I am getting this too with 20 cores on twin Xeon x5675s and 12Gb of memory. This btop screen dump suggests it's CPU bound and that disk & memory aren't the cause of the bottleneck. So, given the claims that it runs happily on a laptop with no GPU I'm assuming something is amiss.

image

hunterdrayman avatar Mar 27 '23 11:03 hunterdrayman

Just for a laugh I asked Google Bard how many cores are needed to run a 7B model:

A 7B model like Alpaca would need a cluster of multiple cores to run at a good conversational speed. The exact number of cores would depend on the specific hardware and software configuration, but it would likely be in the range of hundreds or even thousands of cores.

This is because 7B models are very large and complex, and they require a lot of processing power to run. Even with a powerful computer, it can take several minutes for a 7B model to generate a single response.

However, the speed of a 7B model is not the only important factor. The quality of the responses is also important, and 7B models are known to generate very high-quality responses. This makes them a good choice for applications where accuracy and fluency are important, such as customer service or education.

hunterdrayman avatar Mar 28 '23 15:03 hunterdrayman

Same problem here. Honestly my laptop is 14 years old. However it's surely not as slow as a raspberry pi I guess. Intel core i7-2630QM 2GHz quad core, 16GB RAM, NVIDIA GeForce GT 540M(don't know if the chat uses GPU). :(

RiccaDS avatar Mar 29 '23 12:03 RiccaDS