exllama Issue with open_llama_3b quantization by GPTQ-for-Llama

(exllama) dungnt@symato:~/ext_hdd/repos/gau/exllama$ python test_benchmark_inference.py -d /home/dungnt/ext_hdd/repos/Nhan/GPTQ-for-LLaMa/checkpoints/open_llama_3b/ -v -ppl
 -- Perplexity:
 -- - Dataset: datasets/wikitext2_val_sample.jsonl
 -- - Chunks: 100
 -- - Chunk size: 2048 -> 2048
 -- - Chunk overlap: 0
 -- - Min. chunk size: 50
 -- - Key: text
 -- Tokenizer: /home/dungnt/ext_hdd/repos/Nhan/GPTQ-for-LLaMa/checkpoints/open_llama_3b/tokenizer.model
 -- Model config: /home/dungnt/ext_hdd/repos/Nhan/GPTQ-for-LLaMa/checkpoints/open_llama_3b/config.json
 -- Model: /home/dungnt/ext_hdd/repos/Nhan/GPTQ-for-LLaMa/checkpoints/open_llama_3b/llama3b-4bit-128g.safetensors
 -- Sequence length: 2048
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['validate', 'perplexity']
 ** Time, Load model: 0.85 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 1,956.36 MB
 -- Loading dataset...
 -- Testing 100 chunks..........
 ** Perplexity: nan
 -- Testing 8 chunks.
 ** Perplexity (reconstruct): nan
 -- Testing 8 chunks.
 ** Perplexity (quant, token): nan
Traceback (most recent call last):
  File "/home/dungnt/ext_hdd/repos/gau/exllama/test_benchmark_inference.py", line 245, in <module>
    text = generator.generate_simple("To be or not to be, that is the", max_new_tokens = 20 * args.validate)
  File "/home/dungnt/ext_hdd/repos/gau/exllama/generator.py", line 317, in generate_simple
    token = self.gen_single_token()
  File "/home/dungnt/ext_hdd/repos/gau/exllama/generator.py", line 351, in gen_single_token
    token, _ = self.batched_sample(logits,
  File "/home/dungnt/ext_hdd/repos/gau/exllama/generator.py", line 66, in batched_sample
    if logits.shape[0] == 1: return self.sample(logits, temperature, top_k, top_p, min_p, typical, num)
  File "/home/dungnt/ext_hdd/repos/gau/exllama/generator.py", line 149, in sample
    sampled_ind = torch.multinomial(top_probs, top_probs.shape[-1] if num == -1 else min(num, top_probs.shape[-1]))
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

I tried using GPTQ-for-llama to convert open_llama_3b model to 4bit and when I tested it with exllama I got the above error. I don't know if I did something wrong?

Jul 01 '23 11:07 Iambestfeed

I haven't tested a 3b model, or anything OpenLlama for that matter. Would you mind sharing the quantized model on HF? I can give it a test and see what's up. It might have oddly shaped tensors that the kernels aren't accounting for, or something.

Jul 01 '23 12:07 turboderp

I haven't tested a 3b model, or anything OpenLlama for that matter. Would you mind sharing the quantized model on HF? I can give it a test and see what's up. It might have oddly shaped tensors that the kernels aren't accounting for, or something.

Sorry for this delayed response. I have updated the checkpoint folder to huggingface and you can refer. https://huggingface.co/iambestfeed/open_llama_3b_4bit_128g

Some more information is that I can use example_basic.py and example_chatbot.py but in chatbot, after about 8-10 messages, the above error will appear. As for example_batch, the checkpoint could not be found error.

Jul 01 '23 17:07 Iambestfeed

It seems that 3b uses a head dimension of 100, which is a strange departure from 128 for all the other models. Some of the CUDA kernels assumed it would at least be divisible by 32. I fixed that with the latest commit, and the model seems to work.

I'm not sure what results I'm supposed to be seeing, though. I get somewhat reasonable generations out of it:

My name is Lewis and I like to play the guitar. I am a very good guitar player. I have been playing for a long time. I know you can hear my voice on this album.
I was a little bit disappointed by the sound of the album, but I think it's a bit better than listening to the original albums. The songs are more melodic and less noisy. I think the album has more variety in sound

But perplexity is really bad, 15.8 on the GPTQ-for-LLaMa equivalent test. And the chatbot is spewing semi-coherent nonsense. It could be that OpenLlama-3b is just that bad, or it could be a higher sensitivity to quantization, or it could be that there's still another bug somewhere.

Did you convert it with -eval to get a perplexity score from GPTQ-for-LLaMa?

Jul 01 '23 20:07 turboderp

@turboderp This is the message that is output when i use --eval

(base) dungnt@symato:~/ext_hdd/repos/Nhan/GPTQ-for-LLaMa$ CUDA_VISIBLE_DEVICES=0 python llama.py /home/dungnt/ext_hdd/repos/Nhan/GPTQ-for-LLaMa/checkpoints/open_llama_3b c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors llama3b-4bit-128g.safetensors --eval
[2023-07-02 10:17:40,648] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Found cached dataset json (/home/dungnt/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Found cached dataset json (/home/dungnt/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Token indices sequence length is longer than the specified maximum sequence length for this model (3622 > 2048). Running this sequence through the model will result in indexing errors
Starting ...
Ready.
Quantizing layer 1/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 1071.673     | -          | -         | 0.662 |
| self_attn.v_proj | 29.379       | -          | -         | 0.335 |
| self_attn.q_proj | 723.734      | -          | -         | 0.333 |
| self_attn.o_proj | 0.965        | -          | -         | 0.337 |
| mlp.up_proj      | 162.397      | -          | -         | 0.369 |
| mlp.gate_proj    | 174.016      | -          | -         | 0.367 |
| mlp.down_proj    | 3.513        | -          | -         | 0.981 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 2/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 1097.061     | -          | -         | 0.341 |
| self_attn.v_proj | 64.411       | -          | -         | 0.343 |
| self_attn.q_proj | 986.209      | -          | -         | 0.343 |
| self_attn.o_proj | 5.976        | -          | -         | 0.342 |
| mlp.up_proj      | 758.565      | -          | -         | 0.378 |
| mlp.gate_proj    | 830.632      | -          | -         | 0.369 |
| mlp.down_proj    | 16.768       | -          | -         | 0.999 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 3/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 1529.622     | -          | -         | 0.347 |
| self_attn.v_proj | 183.794      | -          | -         | 0.339 |
| self_attn.q_proj | 1411.783     | -          | -         | 0.338 |
| self_attn.o_proj | 9.497        | -          | -         | 0.345 |
| mlp.up_proj      | 1518.676     | -          | -         | 0.374 |
| mlp.gate_proj    | 1722.314     | -          | -         | 0.369 |
| mlp.down_proj    | 64.279       | -          | -         | 0.995 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 4/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 4931.145     | -          | -         | 0.346 |
| self_attn.v_proj | 849.857      | -          | -         | 0.338 |
| self_attn.q_proj | 4632.041     | -          | -         | 0.341 |
| self_attn.o_proj | 15.999       | -          | -         | 0.348 |
| mlp.up_proj      | 2443.873     | -          | -         | 0.374 |
| mlp.gate_proj    | 2826.377     | -          | -         | 0.370 |
| mlp.down_proj    | 67.572       | -          | -         | 0.996 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 5/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 8163.153     | -          | -         | 0.345 |
| self_attn.v_proj | 1767.942     | -          | -         | 0.338 |
| self_attn.q_proj | 7928.585     | -          | -         | 0.339 |
| self_attn.o_proj | 20.767       | -          | -         | 0.344 |
| mlp.up_proj      | 3479.955     | -          | -         | 0.370 |
| mlp.gate_proj    | 4040.075     | -          | -         | 0.363 |
| mlp.down_proj    | 108.068      | -          | -         | 0.975 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 6/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 8945.602     | -          | -         | 0.340 |
| self_attn.v_proj | 1934.298     | -          | -         | 0.332 |
| self_attn.q_proj | 8608.621     | -          | -         | 0.334 |
| self_attn.o_proj | 31.003       | -          | -         | 0.337 |
| mlp.up_proj      | 4383.026     | -          | -         | 0.367 |
| mlp.gate_proj    | 5172.729     | -          | -         | 0.361 |
| mlp.down_proj    | 189.382      | -          | -         | 0.972 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 7/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 12174.979    | -          | -         | 0.339 |
| self_attn.v_proj | 3123.801     | -          | -         | 0.333 |
| self_attn.q_proj | 12600.689    | -          | -         | 0.334 |
| self_attn.o_proj | 55.012       | -          | -         | 0.338 |
| mlp.up_proj      | 5385.198     | -          | -         | 0.369 |
| mlp.gate_proj    | 6593.714     | -          | -         | 0.363 |
| mlp.down_proj    | 253.930      | -          | -         | 0.975 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 8/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 11439.061    | -          | -         | 0.340 |
| self_attn.v_proj | 3044.984     | -          | -         | 0.333 |
| self_attn.q_proj | 11265.320    | -          | -         | 0.334 |
| self_attn.o_proj | 93.335       | -          | -         | 0.338 |
| mlp.up_proj      | 6299.399     | -          | -         | 0.369 |
| mlp.gate_proj    | 7828.980     | -          | -         | 0.362 |
| mlp.down_proj    | 355.626      | -          | -         | 0.979 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 9/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 13223.848    | -          | -         | 0.340 |
| self_attn.v_proj | 3554.508     | -          | -         | 0.333 |
| self_attn.q_proj | 13011.017    | -          | -         | 0.335 |
| self_attn.o_proj | 133.958      | -          | -         | 0.339 |
| mlp.up_proj      | 7104.735     | -          | -         | 0.369 |
| mlp.gate_proj    | 8435.023     | -          | -         | 0.363 |
| mlp.down_proj    | 456.513      | -          | -         | 0.974 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 10/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 11763.854    | -          | -         | 0.338 |
| self_attn.v_proj | 3345.717     | -          | -         | 0.332 |
| self_attn.q_proj | 11509.735    | -          | -         | 0.333 |
| self_attn.o_proj | 222.783      | -          | -         | 0.336 |
| mlp.up_proj      | 7483.551     | -          | -         | 0.367 |
| mlp.gate_proj    | 8503.588     | -          | -         | 0.362 |
| mlp.down_proj    | 554.916      | -          | -         | 0.974 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 11/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 16446.061    | -          | -         | 0.340 |
| self_attn.v_proj | 4855.263     | -          | -         | 0.332 |
| self_attn.q_proj | 16620.887    | -          | -         | 0.333 |
| self_attn.o_proj | 262.490      | -          | -         | 0.337 |
| mlp.up_proj      | 8744.042     | -          | -         | 0.369 |
| mlp.gate_proj    | 9967.312     | -          | -         | 0.363 |
| mlp.down_proj    | 710.306      | -          | -         | 0.974 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 12/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 15529.008    | -          | -         | 0.340 |
| self_attn.v_proj | 5408.280     | -          | -         | 0.333 |
| self_attn.q_proj | 16053.766    | -          | -         | 0.335 |
| self_attn.o_proj | 375.175      | -          | -         | 0.338 |
| mlp.up_proj      | 9440.375     | -          | -         | 0.369 |
| mlp.gate_proj    | 10503.391    | -          | -         | 0.363 |
| mlp.down_proj    | 830.995      | -          | -         | 0.977 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 13/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 16376.096    | -          | -         | 0.340 |
| self_attn.v_proj | 5793.265     | -          | -         | 0.334 |
| self_attn.q_proj | 16784.537    | -          | -         | 0.335 |
| self_attn.o_proj | 462.980      | -          | -         | 0.339 |
| mlp.up_proj      | 10050.391    | -          | -         | 0.371 |
| mlp.gate_proj    | 11123.831    | -          | -         | 0.364 |
| mlp.down_proj    | 1036.977     | -          | -         | 0.978 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 14/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 17616.238    | -          | -         | 0.340 |
| self_attn.v_proj | 6468.990     | -          | -         | 0.332 |
| self_attn.q_proj | 17997.941    | -          | -         | 0.336 |
| self_attn.o_proj | 584.440      | -          | -         | 0.338 |
| mlp.up_proj      | 11101.712    | -          | -         | 0.368 |
| mlp.gate_proj    | 12068.192    | -          | -         | 0.362 |
| mlp.down_proj    | 1356.283     | -          | -         | 0.974 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 15/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 20549.129    | -          | -         | 0.340 |
| self_attn.v_proj | 8770.324     | -          | -         | 0.333 |
| self_attn.q_proj | 21843.734    | -          | -         | 0.334 |
| self_attn.o_proj | 773.261      | -          | -         | 0.339 |
| mlp.up_proj      | 12603.724    | -          | -         | 0.370 |
| mlp.gate_proj    | 13776.650    | -          | -         | 0.363 |
| mlp.down_proj    | 1886.490     | -          | -         | 0.977 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 16/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 20403.945    | -          | -         | 0.340 |
| self_attn.v_proj | 8810.106     | -          | -         | 0.332 |
| self_attn.q_proj | 20551.609    | -          | -         | 0.334 |
| self_attn.o_proj | 924.665      | -          | -         | 0.337 |
| mlp.up_proj      | 14529.102    | -          | -         | 0.381 |
| mlp.gate_proj    | 15872.477    | -          | -         | 0.364 |
| mlp.down_proj    | 2511.627     | -          | -         | 0.977 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 17/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 21110.586    | -          | -         | 0.341 |
| self_attn.v_proj | 9723.964     | -          | -         | 0.334 |
| self_attn.q_proj | 20994.396    | -          | -         | 0.336 |
| self_attn.o_proj | 1095.254     | -          | -         | 0.339 |
| mlp.up_proj      | 16710.539    | -          | -         | 0.370 |
| mlp.gate_proj    | 18699.088    | -          | -         | 0.364 |
| mlp.down_proj    | 3511.426     | -          | -         | 0.977 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 18/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 21547.160    | -          | -         | 0.342 |
| self_attn.v_proj | 11019.793    | -          | -         | 0.333 |
| self_attn.q_proj | 21703.992    | -          | -         | 0.335 |
| self_attn.o_proj | 1496.042     | -          | -         | 0.339 |
| mlp.up_proj      | 19358.082    | -          | -         | 0.369 |
| mlp.gate_proj    | 22052.793    | -          | -         | 0.362 |
| mlp.down_proj    | 4979.649     | -          | -         | 0.976 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 19/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 22299.379    | -          | -         | 0.341 |
| self_attn.v_proj | 12046.275    | -          | -         | 0.333 |
| self_attn.q_proj | 22375.555    | -          | -         | 0.334 |
| self_attn.o_proj | 1778.020     | -          | -         | 0.338 |
| mlp.up_proj      | 22315.984    | -          | -         | 0.370 |
| mlp.gate_proj    | 25480.859    | -          | -         | 0.363 |
| mlp.down_proj    | 6703.147     | -          | -         | 0.978 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 20/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 23014.500    | -          | -         | 0.340 |
| self_attn.v_proj | 15080.064    | -          | -         | 0.332 |
| self_attn.q_proj | 23925.051    | -          | -         | 0.334 |
| self_attn.o_proj | 2064.781     | -          | -         | 0.339 |
| mlp.up_proj      | 25293.453    | -          | -         | 0.369 |
| mlp.gate_proj    | 28469.402    | -          | -         | 0.363 |
| mlp.down_proj    | 8811.789     | -          | -         | 0.982 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 21/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 24245.307    | -          | -         | 0.341 |
| self_attn.v_proj | 15919.733    | -          | -         | 0.335 |
| self_attn.q_proj | 24663.289    | -          | -         | 0.335 |
| self_attn.o_proj | 1872.579     | -          | -         | 0.338 |
| mlp.up_proj      | 28313.117    | -          | -         | 0.371 |
| mlp.gate_proj    | 31252.977    | -          | -         | 0.365 |
| mlp.down_proj    | 10586.656    | -          | -         | 0.981 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 22/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 24362.797    | -          | -         | 0.340 |
| self_attn.v_proj | 16061.008    | -          | -         | 0.335 |
| self_attn.q_proj | 24946.242    | -          | -         | 0.336 |
| self_attn.o_proj | 2134.037     | -          | -         | 0.340 |
| mlp.up_proj      | 32341.148    | -          | -         | 0.369 |
| mlp.gate_proj    | 34749.535    | -          | -         | 0.365 |
| mlp.down_proj    | 12466.901    | -          | -         | 0.977 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 23/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 25858.715    | -          | -         | 0.341 |
| self_attn.v_proj | 19672.799    | -          | -         | 0.333 |
| self_attn.q_proj | 26381.551    | -          | -         | 0.335 |
| self_attn.o_proj | 3365.131     | -          | -         | 0.339 |
| mlp.up_proj      | 36791.969    | -          | -         | 0.369 |
| mlp.gate_proj    | 38136.703    | -          | -         | 0.363 |
| mlp.down_proj    | 15536.670    | -          | -         | 0.979 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 24/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 25296.309    | -          | -         | 0.340 |
| self_attn.v_proj | 17271.219    | -          | -         | 0.333 |
| self_attn.q_proj | 25645.676    | -          | -         | 0.337 |
| self_attn.o_proj | 3028.885     | -          | -         | 0.338 |
| mlp.up_proj      | 40840.770    | -          | -         | 0.369 |
| mlp.gate_proj    | 41052.309    | -          | -         | 0.363 |
| mlp.down_proj    | 18660.145    | -          | -         | 0.978 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 25/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 25614.988    | -          | -         | 0.339 |
| self_attn.v_proj | 18654.762    | -          | -         | 0.333 |
| self_attn.q_proj | 25714.184    | -          | -         | 0.333 |
| self_attn.o_proj | 3812.069     | -          | -         | 0.338 |
| mlp.up_proj      | 44485.637    | -          | -         | 0.369 |
| mlp.gate_proj    | 43749.414    | -          | -         | 0.362 |
| mlp.down_proj    | 22327.562    | -          | -         | 0.976 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 26/26..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attn.k_proj | 21378.566    | -          | -         | 0.341 |
| self_attn.v_proj | 13922.063    | -          | -         | 0.334 |
| self_attn.q_proj | 20890.971    | -          | -         | 0.336 |
| self_attn.o_proj | 3634.933     | -          | -         | 0.339 |
| mlp.up_proj      | 40836.695    | -          | -         | 0.370 |
| mlp.gate_proj    | 42163.902    | -          | -         | 0.365 |
| mlp.down_proj    | 29832.078    | -          | -         | 0.982 |
+------------------+--------------+------------+-----------+-------+


269.66163897514343
Found cached dataset wikitext (/home/dungnt/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Found cached dataset wikitext (/home/dungnt/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Token indices sequence length is longer than the specified maximum sequence length for this model (2699157 > 2048). Running this sequence through the model will result in indexing errors
wikitext2
Evaluating ...
7.536443710327148

Found cached dataset ptb_text_only (/home/dungnt/.cache/huggingface/datasets/ptb_text_only/penn_treebank/1.1.0/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f)
Found cached dataset ptb_text_only (/home/dungnt/.cache/huggingface/datasets/ptb_text_only/penn_treebank/1.1.0/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f)
Token indices sequence length is longer than the specified maximum sequence length for this model (1113179 > 2048). Running this sequence through the model will result in indexing errors
ptb
Evaluating ...
19.959125518798828

Found cached dataset json (/home/dungnt/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Found cached dataset json (/home/dungnt/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Token indices sequence length is longer than the specified maximum sequence length for this model (3622 > 2048). Running this sequence through the model will result in indexing errors
c4
Evaluating ...
9.593161582946777

Is the ppl getting too big (I see ppl up to 19 on the ptb dataset)?

Jul 02 '23 03:07 Iambestfeed

Perplexity will vary depending on the dataset. 7.53 looks reasonable for wikitext2, though. It's somewhat worse than what's normal for a 7b model, but that's what you'd expect from 3b. I get 15.8, though, on a test that's supposed to run the exact same sequences as GPTQ-for-LLaMa's wikitext2 test, so there's probably still a bug in ExLlama somewhere. I should have a little time tonight to debug it more thoroughly.

Jul 02 '23 15:07 turboderp

Okay, I found another bug, specifically affecting 3b act-order models. With the latest commit I get ppl = 7.86, and I'm going to write off the difference as this model being somewhat more sensitive to FP16 rounding errors due to the low parameter count.

In any case I'm getting good output now. Expected a bit more than 226 tokens/second, but I guess there's room for optimization still.

Jul 02 '23 18:07 turboderp

Okay, I found another bug, specifically affecting 3b act-order models. With the latest commit I get ppl = 7.86, and I'm going to write off the difference as this model being somewhat more sensitive to FP16 rounding errors due to the low parameter count.

In any case I'm getting good output now. Expected a bit more than 226 tokens/second, but I guess there's room for optimization still.

First of all, I would like to express my sincere gratitude for taking the time to understand and support us in resolving the issue. I have taken the time to research and also consulted with members of our team, and we have come to the decision to continue focusing on the 7B model. It may be for several reasons, but the 3B model is still relatively weak, whereas the 7B model offers more skills and a more stable ability to infer.

Jul 03 '23 08:07 Iambestfeed

I'm quite happy that the 3b model works, anyway. I'm not surprised that it's very limited compared to 7b, but it could still be useful as a draft model for speculative sampling, for instance.

Jul 03 '23 11:07 turboderp

https://developer.nvidia.com/downloads/compute/cudnn/secure/8.9.2/local_installers/11.x/cudnn-local-repo-ubuntu2004-8.9.2.26_1.0-1_amd64.deb/

Exactly what I am trying to implement at the moment! Therefore I am so thankful for your work, making it run so damn fast. Do you think there is still room for optimization somewhere for even faster inference? I only occasionally get 200tps, but on average more like 170 (which is still plenty fast, don't get me wrong, thank you so much!)

Jul 13 '23 12:07 SinanAkkoyun

There is some room for optimization, yes, but it's difficult to keep tweaking ExLlama as long as every minor change has the potential to break something people have started to rely on. I think it will be a better use of my time to just keep 3b models in mind for V2. I don't think 500+ tokens/second is unrealistic for 3b.

Jul 13 '23 12:07 turboderp

I totally get that, thank you very much!

Jul 13 '23 13:07 SinanAkkoyun

@Iambestfeed How did you quantize the 3B model specifically? I tried GPTQ for LlaMa (no 3B option when running llama.py), AutoGPTQ seems to work, but I wanted to know what you did with which dataset and which parameters

Thank you

Jul 17 '23 08:07 SinanAkkoyun

@Iambestfeed How did you quantize the 3B model specifically? I tried GPTQ for LlaMa (no 3B option when running llama.py), AutoGPTQ seems to work, but I wanted to know what you did with which dataset and which parameters

Thank you

Hi, first I would like to answer that I use AutoGPTQ with basic usage as to test the correctness of the code. Personally, I realize the huge impact of datasets used for quantization, so I'm still building my own dataset. However, I suggest you use C4. About configuration, 4bit 128gr as in AutoGPTQ.

Jul 17 '23 08:07 Iambestfeed

Hi, first I would like to answer that I use AutoGPTQ with basic usage as to test the correctness of the code. Personally, I realize the huge impact of datasets used for quantization, so I'm still building my own dataset. However, I suggest you use C4. About configuration, 4bit 128gr as in AutoGPTQ.

Thank you very much :)

Jul 17 '23 09:07 SinanAkkoyun

exllama exllama copied to clipboard

Issue with open_llama_3b quantization by GPTQ-for-Llama

exllama
exllama copied to clipboard