bloomz.cpp Bloomz 176B inference doesn't work

Hello,

I have converted bloomz model successfully, but the inference doesn't work.

 ./main -m ./models/ggml-model-bloomz-f16.bin -t 8 -n 128
main: seed = 1679167152
bloom_model_load: loading model from './models/ggml-model-bloomz-f16.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 14336
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 112
bloom_model_load: n_layer = 70
bloom_model_load: f16     = 1
bloom_model_load: n_ff    = 57344
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 333257.61 MB
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 349847586752, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 349847931776, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 351081229760, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 351081459328, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 350670590144, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 349848678784, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 351081976768, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 351082206336, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 351493305664, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 351493305664, available 349445931264)
Segmentation fault (core dumped)

I have enough cpu memory "420GB". Any idea what is the issue ?

Mar 18 '23 19:03 agemagician

out of curiosity and adding a question over your question, how much of your 420GB of RAM did you use to convert to ggml ? I barely managed to convert bloomz-7b1 using 32GB of RAM so I wonder how much 176b needs.

Mar 18 '23 21:03 laurentperez

out of curiosity and adding a question over your question, how much of your 420GB of RAM did you use to convert to ggml ? I barely managed to convert bloomz-7b1 using 32GB of RAM so I wonder how much 176b needs.

all of it + approx 30GB of virtual memory.

Mar 18 '23 22:03 agemagician

It seems you are running out of memory. Most probably,I can help to reduce the memory usage to 1/6th(was successful with 7b1 model). What is the model size(disk usage) of the 176B model? Please share a link to download the quantized model because my server does not have the RAM(>400GB) to quantize the 176B model . I will then see if I am able to run the model

Mar 20 '23 01:03 bil-ash

The disk size for the model is approx 360GB. Unfortunately, quantization doesn't work; please see: https://github.com/huggingface/optimum/issues/901

I don't think it is a problem with out-of-memory as there is 420GB of main memory + 50 swap memory.

Mar 20 '23 07:03 agemagician

same question, while I have 1000GB RAM

Mar 21 '23 12:03 ZhangYunchenY

./main -m models/bloom/ggml-model-bloom-f16-q4_0.bin -t 96 -p "The most beautiful question is" -n 20 main: seed = 1680447842 bloom_model_load: loading model from 'models/bloom/ggml-model-bloom-f16-q4_0.bin' - please wait ... bloom_model_load: n_vocab = 250880 bloom_model_load: n_ctx = 512 bloom_model_load: n_embd = 14336 bloom_model_load: n_mult = 1 bloom_model_load: n_head = 112 bloom_model_load: n_layer = 70 bloom_model_load: f16 = 2 bloom_model_load: n_ff = 57344 bloom_model_load: n_parts = 1 bloom_model_load: ggml ctx size = 106877.59 MB bloom_model_load: memory_size = 3920.00 MB, n_mem = 35840 bloom_model_load: loading model part 1/1 from 'models/bloom/ggml-model-bloom-f16-q4_0.bin' bloom_model_load: ......................................................................................................... done bloom_model_load: model size = 107237.48 MB / num tensors = 846

main: prompt: 'The most beautiful question is' main: number of tokens in prompt = 5 2175 -> 'The' 6084 -> ' most' 40704 -> ' beautiful' 5893 -> ' question' 632 -> ' is'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

The most beautiful question is the one you ask yourself. What are we doing here? I don't understand this at all! L

main: mem per token = 192093564 bytes main: load time = 65292.77 ms main: sample time = 498.68 ms main: predict time = 407606.25 ms / 16983.59 ms per token main: total time = 537545.81 ms

Apr 02 '23 15:04 barsuna

Above was produced with this commit https://github.com/barsuna/bloomz.cpp/commit/2d0e478c653d078554af0188c90c7081ff0b3059

Apr 02 '23 15:04 barsuna

I have a cluster running scientific linux with basically unlimited ram but 4x15gb vram I can test things on. If anybody gets a GGML that is worth testing, tell me.

May 28 '23 10:05 bozo32

Getting lost in this thread, just converted the 176B model into GGML, fp16, and now looking at using bloom.cpp, but noticed that @barsuma Readme appears to reflect there there are still problems. Could we get a status update? Doesn't look like his code is a pull request, or that this code has been updated to solve the issue, but I am not sure.

Jun 09 '23 00:06 linuxmagic-mp