bloomz.cpp
bloomz.cpp copied to clipboard
Quantization doesn't work with Bloomz 176B
Hello,
I have successfully converted the bloomz 176B model to fp16. However, the quantization doesn't work and throw an error:
./quantize ./models/ggml-model-bloomz-f16.bin ./models/ggml-model-bloomz-f16-q4_0.bin 2
bloom_model_quantize: loading model from './models/ggml-model-bloomz-f16.bin'
bloom_model_quantize: n_vocab = 250880
bloom_model_quantize: n_ctx = 512
bloom_model_quantize: n_embd = 14336
bloom_model_quantize: n_mult = 1
bloom_model_quantize: n_head = 112
bloom_model_quantize: n_layer = 70
bloom_model_quantize: f16 = 1
terminate called after throwing an instance of 'std::length_error'
what(): vector::_M_default_append
Aborted (core dumped)
Any idea how this could be fixed ?
same question
Unfortunately I need more details than that :/ Did you try with other models like 7B1 and it works? You only get this problem with 176B?
Yes, it works for BLOOMZ-560m and BLOOMZ-7B1. I got the same problem shown in the @agemagician error message.
Oh, seeing https://github.com/NouamaneTazi/bloomz.cpp/issues/15, it seems you have already solved this issue? What was the problem? @agemagician
Oh, seeing #15, it seems you have already solved this issue? What was the problem? @agemagician
I cannot inference the 176B FP16 model, while I have 1TB RAM. And the same error message is shown in @agemagician #15 . It works for 560M and 7B1.
#15 using fp16 not 4-bit model.
Here is where it seems to crash...
$ g++ -I. -I./examples -g -std=c++11 -fPIC -pthread quantize.cpp ggml.o utils.o -o quantize
$ gdb --args ./quantize ./models/bloom/ggml-model-bloom-f16.bin ./models/bloom/ggml-model-bloomz-f16-q4_0.bin 2
Reading symbols from ./quantize...
(gdb)
(gdb) list 190
185 if (ftype != 0 && ftype != 1) {
186 fprintf(stderr, "%s: unsupported ftype %d for integer quantization\n", func, ftype);
187 return false;
188 }
189
190 if (ftype == 1) {
191 data_f16.resize(nelements);
192 finp.read(reinterpret_cast<char *>(data_f16.data()), nelements * sizeof(ggml_fp16_t));
193 data_f32.resize(nelements);
194 for (int i = 0; i < nelements; ++i) {
(gdb) break 190
Breakpoint 1 at 0x7796: file quantize.cpp, line 190.
(gdb) run
Starting program: ./quantize ./models/bloom/ggml-model-bloom-f16.bin ./models/bloom/ggml-model-bloomz-f16-q4_0.bin 2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
bloom_modal_quantize: loading model from './models/bloom/ggml-model-bloom-f16.bin'
bloom_model_quantize: n_vocab = 250880
bloom_model_quantize: n_ctx = 512
bloom_model_quantize: n_embd = 14336
bloom_model_quantize: n_mult = 1
bloom_model_quantize: n_head = 112
bloom_model_quantize: n_layer = 70
bloom_model_quantize: f16 = 1
Breakpoint 1, bloom_model_quantize (fname_inp="./models/bloom/ggml-model-bloom-f16.bin", fname_out="./models/bloom/ggml-model-bloomz-f16-q4_0.bin", itype=2) at quantize.cpp:190 190 if (ftype == 1) { (gdb) next 191 data_f16.resize(nelements); (gdb) frame #0 bloom_model_quantize (fname_inp="./models/bloom/ggml-model-bloom-f16.bin", fname_out="./models/bloom/ggml-model-bloomz-f16-q4_0.bin", itype=2) at quantize.cpp:191 191 data_f16.resize(nelements); (gdb) p nelements $1 = -698351616 (gdb) p data_f16 $2 = std::vector of length 0, capacity 0 (gdb) next terminate called after throwing an instance of 'std::length_error' what(): vector::_M_default_append
Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb)
(gdb) where
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff7a77859 in __GI_abort () at abort.c:79
#2 0x00007ffff7e72911 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007ffff7e7e38c in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ffff7e7e3f7 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ffff7e7e6a9 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007ffff7e75326 in std::__throw_length_error(char const*) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x0000555555560da7 in std::vector<unsigned short, std::allocator
Quantization for 176B works with this commit https://github.com/barsuna/bloomz.cpp/commit/2d0e478c653d078554af0188c90c7081ff0b3059 Inference is also working
Can we get a clearer status update? Your readme isn't clear whether everything is good with the 176B quantize, I am still having a problem with it on bloom.cpp, and not sure where your status is on this? Any word on whether patches/fixes will get into bloom.cpp?
Hi @barsuna
Thank you very much for making your fork to fix quantising with 176B. I recently quantised BloomZ 176B and Bloom Chat 176B to GPTQ and released to HF Hub, and today wanted to do GGML as well. I hit the issue described in this thread and your fork enabled me to quantise the models.
Unfortunately there appears to be an inference problem. I was wondering if you saw this too, and might have any idea what is wrong?
The issue is that it seems to be missing words out, or skipping over words. Here's some examples testing q4_0 with BloomChat 176B (issue is the same with BloomZ 176B):
<human>: write a story about llamas\n<bot>: Once upon a time, in the land of Spain there were two small and fluffy creatures known as Llama's. They lived happily together with their names were Fred and Wilma they way down. For many years, The one day when the started having a great fun adventure traveling across mountainside through new lands discovering different cultures and new things, but upon arriving to them until 1 night stoped at an o nices or so called Peru. They met a town that beforested hot them some there new friends along with this amazing of many llamas where they way side.
they road. The next day 2 ther being
<human>: write a story about llamas\n<bot>: Once upon a time there were two llamas named Mac and Cheese who wanted to get out of their boring home in the farm. They heard some where they could find a new friends on an exciting place called city full of adventures. The found a train ride.
</s> [end of text]
<human>:tell me about Paris\n<bot>: The City of Light
Paris, often simply known as Paris (UK: /ˈpaɪərz/;[2] US:z; French: [paʁi]), is the capital and most populous city of France. With a country which forms part of Île-de-France region on the northern الباطل Peninsula italy or Normandy layead)
[note also called Pasde lay in with overal-Paris, parisi/ (French pronunciation: [pajونٹیʁis i<sup>jɛ̃]), is capitalregion Paris;[3][4]) and often shortened Parigi basilica a latinu Seine
Paris claiments: Pari),[5] or just
The story prompts seem coherent, but then it's like it suddenly skips forward in the sentence by a few words. Then the Paris prompt is half coherent, half not, and again looks like bits are missing.
Is there any chance you might know what is wrong, or could look to fix it? If so I will be able to release 176B GGMLs to HF Hub and there's quite a few people who would love to try them.
Thanks in advance.