bloomz.cpp icon indicating copy to clipboard operation
bloomz.cpp copied to clipboard

Quantization doesn't work with Bloomz 176B

Open agemagician opened this issue 1 year ago • 10 comments

Hello,

I have successfully converted the bloomz 176B model to fp16. However, the quantization doesn't work and throw an error:

./quantize ./models/ggml-model-bloomz-f16.bin ./models/ggml-model-bloomz-f16-q4_0.bin 2
bloom_model_quantize: loading model from './models/ggml-model-bloomz-f16.bin'
bloom_model_quantize: n_vocab = 250880
bloom_model_quantize: n_ctx   = 512
bloom_model_quantize: n_embd  = 14336
bloom_model_quantize: n_mult  = 1
bloom_model_quantize: n_head  = 112
bloom_model_quantize: n_layer = 70
bloom_model_quantize: f16     = 1
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted (core dumped)

Any idea how this could be fixed ?

agemagician avatar Mar 18 '23 19:03 agemagician

same question

ZhangYunchenY avatar Mar 21 '23 12:03 ZhangYunchenY

Unfortunately I need more details than that :/ Did you try with other models like 7B1 and it works? You only get this problem with 176B?

NouamaneTazi avatar Mar 22 '23 10:03 NouamaneTazi

Yes, it works for BLOOMZ-560m and BLOOMZ-7B1. I got the same problem shown in the @agemagician error message.

ZhangYunchenY avatar Mar 22 '23 10:03 ZhangYunchenY

Oh, seeing https://github.com/NouamaneTazi/bloomz.cpp/issues/15, it seems you have already solved this issue? What was the problem? @agemagician

NouamaneTazi avatar Mar 22 '23 10:03 NouamaneTazi

Oh, seeing #15, it seems you have already solved this issue? What was the problem? @agemagician

I cannot inference the 176B FP16 model, while I have 1TB RAM. And the same error message is shown in @agemagician #15 . It works for 560M and 7B1.

ZhangYunchenY avatar Mar 22 '23 11:03 ZhangYunchenY

#15 using fp16 not 4-bit model.

agemagician avatar Mar 22 '23 11:03 agemagician

Here is where it seems to crash...

$ g++ -I. -I./examples -g -std=c++11 -fPIC -pthread quantize.cpp ggml.o utils.o -o quantize $ gdb --args ./quantize ./models/bloom/ggml-model-bloom-f16.bin ./models/bloom/ggml-model-bloomz-f16-q4_0.bin 2 Reading symbols from ./quantize... (gdb) (gdb) list 190 185 if (ftype != 0 && ftype != 1) { 186 fprintf(stderr, "%s: unsupported ftype %d for integer quantization\n", func, ftype); 187 return false; 188 } 189
190 if (ftype == 1) { 191 data_f16.resize(nelements); 192 finp.read(reinterpret_cast<char *>(data_f16.data()), nelements * sizeof(ggml_fp16_t)); 193 data_f32.resize(nelements); 194 for (int i = 0; i < nelements; ++i) { (gdb) break 190 Breakpoint 1 at 0x7796: file quantize.cpp, line 190. (gdb) run Starting program: ./quantize ./models/bloom/ggml-model-bloom-f16.bin ./models/bloom/ggml-model-bloomz-f16-q4_0.bin 2 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". bloom_modal_quantize: loading model from './models/bloom/ggml-model-bloom-f16.bin' bloom_model_quantize: n_vocab = 250880 bloom_model_quantize: n_ctx = 512 bloom_model_quantize: n_embd = 14336 bloom_model_quantize: n_mult = 1 bloom_model_quantize: n_head = 112 bloom_model_quantize: n_layer = 70 bloom_model_quantize: f16 = 1

Breakpoint 1, bloom_model_quantize (fname_inp="./models/bloom/ggml-model-bloom-f16.bin", fname_out="./models/bloom/ggml-model-bloomz-f16-q4_0.bin", itype=2) at quantize.cpp:190 190 if (ftype == 1) { (gdb) next 191 data_f16.resize(nelements); (gdb) frame #0 bloom_model_quantize (fname_inp="./models/bloom/ggml-model-bloom-f16.bin", fname_out="./models/bloom/ggml-model-bloomz-f16-q4_0.bin", itype=2) at quantize.cpp:191 191 data_f16.resize(nelements); (gdb) p nelements $1 = -698351616 (gdb) p data_f16 $2 = std::vector of length 0, capacity 0 (gdb) next terminate called after throwing an instance of 'std::length_error' what(): vector::_M_default_append

Program received signal SIGABRT, Aborted. __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) (gdb) where #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007ffff7a77859 in __GI_abort () at abort.c:79 #2 0x00007ffff7e72911 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #3 0x00007ffff7e7e38c in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x00007ffff7e7e3f7 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #5 0x00007ffff7e7e6a9 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007ffff7e75326 in std::__throw_length_error(char const*) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x0000555555560da7 in std::vector<unsigned short, std::allocator >::_M_check_len (this=0x7fffffffd800, __n=18446744073011200000, __s=0x5555555b9454 "vector::_M_default_append") at /usr/include/c++/7/bits/stl_vector.h:1505 #8 0x000055555555eecc in std::vector<unsigned short, std::allocator >::_M_default_append (this=0x7fffffffd800, __n=18446744073011200000) at /usr/include/c++/7/bits/vector.tcc:568 #9 0x000055555555dd81 in std::vector<unsigned short, std::allocator >::resize (this=0x7fffffffd800, __new_size=18446744073011200000) at /usr/include/c++/7/bits/stl_vector.h:692 #10 0x000055555555b7c0 in bloom_model_quantize (fname_inp="./models/bloom/ggml-model-bloom-f16.bin", fname_out="./models/bloom/ggml-model-bloomz-f16-q4_0.bin", itype=2) at quantize.cpp:191 #11 0x000055555555c49f in main (argc=4, argv=0x7fffffffdf68) at quantize.cpp:316 (gdb)

barsuna avatar Mar 26 '23 14:03 barsuna

Quantization for 176B works with this commit https://github.com/barsuna/bloomz.cpp/commit/2d0e478c653d078554af0188c90c7081ff0b3059 Inference is also working

barsuna avatar Apr 02 '23 15:04 barsuna

Can we get a clearer status update? Your readme isn't clear whether everything is good with the 176B quantize, I am still having a problem with it on bloom.cpp, and not sure where your status is on this? Any word on whether patches/fixes will get into bloom.cpp?

linuxmagic-mp avatar Jun 09 '23 01:06 linuxmagic-mp

Hi @barsuna

Thank you very much for making your fork to fix quantising with 176B. I recently quantised BloomZ 176B and Bloom Chat 176B to GPTQ and released to HF Hub, and today wanted to do GGML as well. I hit the issue described in this thread and your fork enabled me to quantise the models.

Unfortunately there appears to be an inference problem. I was wondering if you saw this too, and might have any idea what is wrong?

The issue is that it seems to be missing words out, or skipping over words. Here's some examples testing q4_0 with BloomChat 176B (issue is the same with BloomZ 176B):

<human>: write a story about llamas\n<bot>: Once upon a time, in the land of Spain there were two small and fluffy creatures known as Llama's.  They lived happily together with their names were Fred and Wilma they way down. For many years,  The one day when the started having a great fun adventure traveling across mountainside through new lands discovering different cultures and new things, but upon arriving to them until 1 night stoped at an o nices or so called Peru.  They met a town that beforested hot them some there new friends along with this amazing of many llamas where they way side.

they road. The next day 2 ther  being
<human>: write a story about llamas\n<bot>: Once upon a time there were two llamas named Mac and Cheese who wanted to get out of their boring home in the farm. They heard some where they could find a new friends on an exciting place called city full of adventures.  The found a train ride.
</s> [end of text]
<human>:tell me about Paris\n<bot>: The City of Light
Paris, often simply known as Paris (UK: /ˈpaɪərz/;[2] US:z; French: [paʁi]), is the capital and most populous city of France. With a country which forms part of Île-de-France region on the northern الباطل Peninsula italy or Normandy layead)

[note also called Pasde lay in with overal-Paris, parisi/ (French pronunciation: [pajونٹیʁis i<sup>jɛ̃]), is capitalregion Paris;[3][4]) and often shortened Parigi basilica a latinu Seine
Paris claiments: Pari),[5] or just

The story prompts seem coherent, but then it's like it suddenly skips forward in the sentence by a few words. Then the Paris prompt is half coherent, half not, and again looks like bits are missing.

Is there any chance you might know what is wrong, or could look to fix it? If so I will be able to release 176B GGMLs to HF Hub and there's quite a few people who would love to try them.

Thanks in advance.

TheBloke avatar Jul 08 '23 20:07 TheBloke