llama-gpt icon indicating copy to clipboard operation
llama-gpt copied to clipboard

Why is it so slow?

Open sducxh opened this issue 2 years ago • 3 comments

My server information CPU:32 cores GPU: 2x Nvidia 3080 10G RAM: 64G Docker version 24.0.2 NVIDIA-SMI 520.61.05
Driver Version: 520.61.05
CUDA Version: 11.8

I deployed it using Docker. --model code-13b and --with-cuda I changed base image with:11.8.0-devel-ubuntu22.04,devices.count=2 After my testing each token takes about 1 minute, So I want to know why it's so slow and what adjustments I can make.

sducxh avatar Aug 28 '23 07:08 sducxh

Super slow here too.

Running the 7b model on a 24G ram 8-core Xeon.

I installed it using UmbrelOS on a proxmox debian 12 LXC.

omiinaya avatar Aug 28 '23 15:08 omiinaya

I an slow here too !!!

EZTTU avatar Sep 04 '23 08:09 EZTTU

@sducxh - two things:

-In /cuda/run.sh, the key value to adjust to speed things up is n_gpu_layers. 10 (hard coded default value) is too low for beefy graphics cards; settings this to 40 for my 3080Ti made a huge improvement. Try incrementing this in 5's and restart/retest. -I don't have a second GPU to test if this helps split the loads across multiple cards, but In docker-compose-cuda-gguf.yml, try setting 'count' to 2 and study your GPU usages

arch1v1st avatar Sep 23 '23 23:09 arch1v1st