llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Cuda runtime error and slow eval

Open huichen opened this issue 2 years ago • 3 comments

Using a09f919 and compiled with

make clean && LLAMA_CUBLAS=1 LLAMA_CUDA_DMMV_X=64 LLAMA_CUDA_DMMV_Y=2 make -j

Running with command, on 4x A40-48G

./main -m ggml-vic13b-q5_1.bin -ngl 1000 -p "the meaning of life is" -t 8 -c 2048

Got error

CUDA error 1 at ggml-cuda.cu:2292: invalid argument

output

llama_model_load_internal: format     = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA A40) as main device
llama_model_load_internal: mem required  = 2165.28 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 11314 MB
....................................................................................................
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 8 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0


 the meaning of life is to enjoy it.
Author: Epicurus
62. “Life is a journey, and like all journeys, it must come to an end. But the memories we make along the way will live on forever.”
63. “The biggest adventure you can ever take is to live the life of your dreams.”
Author: Oprah Winfrey
64. “Life is like a camera, focus on the good times, develop from the negatives, and keep shooting.”
Author: Unknown (often attributed to Tommy De Senna)
65. “The purpose of our lives is to be happy.”
Author: Dalai Lama [end of text]

llama_print_timings:        load time =  2831.64 ms
llama_print_timings:      sample time =    70.88 ms /   144 runs   (    0.49 ms per token)
llama_print_timings: prompt eval time =   184.13 ms /     6 tokens (   30.69 ms per token)
llama_print_timings:        eval time =  6122.05 ms /   143 runs   (   42.81 ms per token)
llama_print_timings:       total time =  6421.17 ms
CUDA error 1 at ggml-cuda.cu:2292: invalid argument

also prompt eval time with long prompt became much longer, ~12ms/token vs 3ms/token few days ago.

huichen avatar Jun 16 '23 04:06 huichen

Not sure about the error, but does setting threads to 1 improve your performance? When offloaded to the gpu the cpu is actually blocking more than helping.

Azeirah avatar Jun 16 '23 19:06 Azeirah

Not sure about the error, but does setting threads to 1 improve your performance? When offloaded to the gpu the cpu is actually blocking more than helping.

Threads to 1 degrades the performance, while for my case threads=8 is optimal.

huichen avatar Jun 19 '23 03:06 huichen

The assert error is fixed by #2005

For the performance, if the model can fit into one GPU, you can consider only exposing one GPU to llama.cpp via set CUDA_VISIBLE_DEVICES=0. In the latest code, we have multi GPU support which may be the cause of your slowness.

howard0su avatar Jun 26 '23 15:06 howard0su

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 10 '24 01:04 github-actions[bot]