llama.cpp
llama.cpp copied to clipboard
System freeze when compiled with cublast
When running main compiled with cublast in newest releases (305eb5a), everything works fine until right before returning to command prompt. The timing info pops up then my system completely freeze for about 20 seconds.
Release b1ee8f5 is working fine.
I cannot reproduce this on my system without more details.
Using a 7B model the freeze is about 5 seconds, 30B model 20 seconds. I tried using --no-map with the 30B model and the system froze for 5 minutes(!) right before displaying the system_info...
I think the problem is the update with cublas pinned host memory, as the freeze seems to appear when initalizing or freeing memory somehow.
The prompt eval time is 2.5 times slower also:
Release 305eb5a output:
./main -m ../llama-33b-supercot-ggml-q5_1.bin -c 2048 -p "Hiking is" -n 16 -t 6
main: seed = 1682775656
llama.cpp: loading model from ../llama-33b-supercot-ggml-q5_1.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 110.30 KB
llama_model_load_internal: mem required = 25573.12 MB (+ 3124.00 MB per state)
llama_init_from_file: kv self size = 3120.00 MB
(system freeze for 5 min with --no-mmap)
system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0
Hiking is one of the best ways to explore and experience a destination. From leisure
llama_print_timings: load time = 7378.95 ms
llama_print_timings: sample time = 10.95 ms / 16 runs ( 0.68 ms per run)
llama_print_timings: prompt eval time = 5005.25 ms / 5 tokens ( 1001.05 ms per token)
llama_print_timings: eval time = 9425.01 ms / 15 runs ( 628.33 ms per run)
llama_print_timings: total time = 16818.56 ms
(system freeze for 20 sec with mmap)
Release b1ee8f5 output:
./main -m ../llama-33b-supercot-ggml-q5_1.bin -c 2048 -p "Hiking is" -n 16 -t 6
main: seed = 1682775744
llama.cpp: loading model from ../llama-33b-supercot-ggml-q5_1.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 110.30 KB
llama_model_load_internal: mem required = 25573.12 MB (+ 3124.00 MB per state)
llama_init_from_file: kv self size = 3120.00 MB
system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0
Hiking is one of the best ways to experience nature. There’s nothing quite like tre
llama_print_timings: load time = 3253.56 ms
llama_print_timings: sample time = 9.46 ms / 16 runs ( 0.59 ms per run)
llama_print_timings: prompt eval time = 1972.92 ms / 5 tokens ( 394.58 ms per token)
llama_print_timings: eval time = 9242.09 ms / 15 runs ( 616.14 ms per run)
llama_print_timings: total time = 12505.32 ms
I can't really do much about it if I am not able to reproduce it. Some search indicates that they were able to solve this by "reinstalled everything from scratch".
Thanks. So it seems to be related to Ubuntu and / or AMD cpus. I'm running Ubuntu 20.04 and have an AMD Ryzen 5 cpu.
I found out what the problem is. The model did not fit into RAM. When using the b1ee8f5 release it works even if the model dont fit in RAM, but when using the new 305eb5a release with cublas pinned host memory my system will completely freeze. I suggest implementing a memory check at start to determinate if this new mode should be enabled or not.
It must fit into RAM to use pinned memory at all, as this is memory that cannot be swapped. I can see this happening if you just have barely enough memory to fit the model, and everything else is forced to move into the swap; but if that was the case, I would expect that the slow operation would be the alloc, not the free. Maybe it is just slowly bringing back the shell memory from the swap after the program ends.
I am not convinced that we should do anything about it either way, or even if we can do anything about it. If you try to use a program that requires more memory than your system has, it is not unexpected that things will fail to work, or will work very slowly.
In any case, I am open to suggestions about how to handle this, just checking if there is enough memory is not nearly as easy to do as you are implying here.
Maybe implement a parameter to not use pinned memory, as the previous version did work fine on swapped memory.
I have added an environment variable GGML_CUDA_NO_PINNED that you can set to disable pinned memory in PR #1233.
Great! :)