llama.cpp Segfault with 65B model

This is the output with -fsanitize=address:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==167666==ERROR: AddressSanitizer: SEGV on unknown address 0x558c0562c438 (pc 0x558a27cc9807 bp 0x000000000000 sp 0x7ffeb2f57310 T0)
==167666==The signal is caused by a READ memory access.
    #0 0x558a27cc9807 in ggml_element_size (/home/mattmcal/repos/llama.cpp/main+0x49807)
    #1 0x558a27c9c03c in llama_eval(llama_model const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<float, std::allocator<float> >&, unsigned long&) (/home/mattmcal/repos/llama.cpp/main+0x1c03c)
    #2 0x558a27c960fb in main (/home/mattmcal/repos/llama.cpp/main+0x160fb)
    #3 0x7fe45e046189 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #4 0x7fe45e046244 in __libc_start_main_impl ../csu/libc-start.c:381
    #5 0x558a27c9b1a0 in _start (/home/mattmcal/repos/llama.cpp/main+0x1b1a0)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/home/mattmcal/repos/llama.cpp/main+0x49807) in ggml_element_size

I had to increase ctx_size otherwise I got this error:

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 33373704448, available 33292002560)

Is GGML trying to use more RAM than it malloc'd?

Mar 13 '23 07:03 matthew-mcallister

ggml buffers are preallocated with fixed mem size. If you run out of the buffer during inference, you get this error. It's very possible that for some parameters, the mem size is not enough. This will be improved over time.

Can you provide the parameters for which you get this error?

Mar 13 '23 16:03 ggerganov

Basically, this fails if I increase n_ctx, beyond the default 512, which I can tell isn't fully supported. I increased the mem_size allocated by ggml by adding to ctx_size, but it still uses more memory than allocated without showing any warning/error messages.

These parameters

llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 1024
llama_model_load: n_embd  = 8192
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 68613.73 MB
llama_model_load: memory_size =  5120.00 MB, n_mem = 81920
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.150000

actually cause a null dereference partway through inference:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==27991==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x55e82aebc227 bp 0x7ffddaffb650 sp 0x7ffddaffb640 T0)
==27991==The signal is caused by a WRITE memory access.
==27991==Hint: address points to the zero page.
    #0 0x55e82aebc227 in quantize_row_q4_0 (/home/mattmcal/repos/llama.cpp/main+0x44227)
    #1 0x55e82aebcacb in ggml_compute_forward_mul_mat_q4_0_f32 (/home/mattmcal/repos/llama.cpp/main+0x44acb)
    #2 0x55e82aecd36c in ggml_compute_forward (/home/mattmcal/repos/llama.cpp/main+0x5536c)
    #3 0x55e82aeda061 in ggml_graph_compute (/home/mattmcal/repos/llama.cpp/main+0x62061)
    #4 0x55e82ae94540 in llama_eval(llama_model const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<float, std::allocator<float> >&, unsigned long&) (/home/mattmcal/repos/llama.cpp/main+0x1c540)
    #5 0x55e82ae8e5b0 in main (/home/mattmcal/repos/llama.cpp/main+0x165b0)
    #6 0x7f9a9c646189 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #7 0x7f9a9c646244 in __libc_start_main_impl ../csu/libc-start.c:381
    #8 0x55e82ae931a0 in _start (/home/mattmcal/repos/llama.cpp/main+0x1b1a0)

Mar 13 '23 18:03 matthew-mcallister

related discussion https://github.com/ggerganov/llama.cpp/issues/71

Mar 13 '23 18:03 drewcrawford

@matthew-mcallister Can you try again with the code from master (which is now using mmap to load the weights)?

Mar 30 '23 23:03 prusnak

~It still segfaults after 512 tokens.~

EDIT: Hold on, I might be mistaken. I haven't finished converting all the tensors yet.

Mar 31 '23 02:03 matthew-mcallister

OK, this works now. Fantastic, thanks for the update!

Mar 31 '23 05:03 matthew-mcallister

llama.cpp llama.cpp copied to clipboard

Segfault with 65B model

llama.cpp
llama.cpp copied to clipboard