I wanted to know what is the beauty of technology
I welcome the developers of the new method. I launched your goolge colab and got very dubious inference. Can you explain a little about the essence of the breakthrough of this compression? I may not have understood something in terms of use, but judging by the inference, it is completely lost.
quantized_model = AutoModelForCausalLM.from_pretrained(
"BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf", trust_remote_code=True, torch_dtype=torch.float16,
).cuda()
tokenizer = AutoTokenizer.from_pretrained("BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf")
inputs = tokenizer(["Write a poem about python"], return_tensors="pt")["input_ids"].cuda()
streamer = TextStreamer(tokenizer)
_ = quantized_model.generate(inputs, streamer=streamer, max_new_tokens=120)
<s> Write a poem about python.
Write a poem about python.
Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem
Hi! It is, in fact, quite stuck :)
I believe what you're observing is sampling degeneration, which happens because the model performs "greedy" inference, which makes it inclined to repeat itself[1].
Try inferencing the same model with do_sample=True and possibly add temperature, e.g.
quantized_model.generate(inputs, streamer=streamer, max_new_tokens=128,
do_sample = True,
temperature = 0.8,
top_p = 0.9)
[1] https://arxiv.org/abs/1904.09751
(inputs, streamer=streamer, max_new_tokens=128, do_sample = True, temperature = 0.8, top_p = 0.9)
inputs = tokenizer(["Write a poem about the Python language"], return_tensors="pt")["input_ids"].cuda()
streamer = TextStreamer(tokenizer)
_ = quantized_model.generate(inputs, streamer=streamer, max_new_tokens=128,
do_sample = True,
temperature = 0.8,
top_p = 0.9)
Write a poem about the Python language. Python has a lot of different ways to write code. If you've used Python before, you're probably familiar with the following: Python has a lot of different ways to write code. If you've used Python before, you're probably familiar with the following: 1) A Python file is a file that contains a Python program. 2) A Python file is a file that contains a Python program. 3) A Python file is a file that contains a Python program. 4) A Python file is a file that contains a Python program. 5) A Python file is a file that contains
Same result.
Hello!
I believe the issue may not be related to quantization. I ran your prompt on not quantized LLama-2 7b. And got repetition too(see images below). This is known problem. You can tune generation parameters(like repetition_penalty=1.1 e.t.c.) to minimize probability of getting repetition.
).
P.s. BTW llama-2 7B model is not a instruction tuned model. So it would not work best at this scenarios. For chat like behavior it is better to use instruction tuned models.
P.s. BTW llama-2 7B model is not a instruction tuned model. So it would not work best at this scenarios. For chat like behavior it is better to use instruction tuned models.
Yes, indeed, the conclusion of the original is not very good)))
@Vahe1994 in your repos you have to AQLM and SpQR Which of these methods do you think is better and why? maybe one grew out of the other? Can you explain?
Also a question: when you quantized LLama-7b, how much memory did you spend on quantization during the peak load of the quantization script?
I am not @Vahe1994 , but to the best of my knowledge, AQLM compared their results in SPQR
For instance, here's table 10 on page 15 in AQLM
This suggests that AQLM is better at 4-bit compression - and other tables report similar results for 2 and 3 bits
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.