llama-gpt
                                
                                 llama-gpt copied to clipboard
                                
                                    llama-gpt copied to clipboard
                            
                            
                            
                        Why is it so slow?
My server information
CPU:32 cores
GPU: 2x Nvidia 3080 10G
RAM: 64G
Docker version 24.0.2
NVIDIA-SMI 520.61.05
Driver Version: 520.61.05
CUDA Version: 11.8
I deployed it using Docker. --model code-13b and --with-cuda I changed base image with:11.8.0-devel-ubuntu22.04,devices.count=2 After my testing each token takes about 1 minute, So I want to know why it's so slow and what adjustments I can make.
Super slow here too.
Running the 7b model on a 24G ram 8-core Xeon.
I installed it using UmbrelOS on a proxmox debian 12 LXC.
I an slow here too !!!
@sducxh - two things:
-In /cuda/run.sh, the key value to adjust to speed things up is n_gpu_layers. 10 (hard coded default value) is too low for beefy graphics cards; settings this to 40 for my 3080Ti made a huge improvement. Try incrementing this in 5's and restart/retest. -I don't have a second GPU to test if this helps split the loads across multiple cards, but In docker-compose-cuda-gguf.yml, try setting 'count' to 2 and study your GPU usages