AQLM I wanted to know what is the beauty of technology

I welcome the developers of the new method. I launched your goolge colab and got very dubious inference. Can you explain a little about the essence of the breakthrough of this compression? I may not have understood something in terms of use, but judging by the inference, it is completely lost.

quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf", trust_remote_code=True, torch_dtype=torch.float16,
).cuda()
tokenizer = AutoTokenizer.from_pretrained("BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf")

inputs = tokenizer(["Write a poem about python"], return_tensors="pt")["input_ids"].cuda()

streamer = TextStreamer(tokenizer)
_ = quantized_model.generate(inputs, streamer=streamer, max_new_tokens=120)

<s> Write a poem about python.
Write a poem about python.
Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem

Feb 08 '24 18:02 meomeomeome

Hi! It is, in fact, quite stuck :)

I believe what you're observing is sampling degeneration, which happens because the model performs "greedy" inference, which makes it inclined to repeat itself[1].

Try inferencing the same model with do_sample=True and possibly add temperature, e.g.

quantized_model.generate(inputs, streamer=streamer,  max_new_tokens=128,
            do_sample = True,
            temperature = 0.8,
            top_p = 0.9)

[1] https://arxiv.org/abs/1904.09751

Feb 09 '24 15:02 Vahe1994

(inputs, streamer=streamer,  max_new_tokens=128,
            do_sample = True,
            temperature = 0.8,
            top_p = 0.9)

inputs = tokenizer(["Write a poem about the Python language"], return_tensors="pt")["input_ids"].cuda()

streamer = TextStreamer(tokenizer)
_ = quantized_model.generate(inputs, streamer=streamer,  max_new_tokens=128,
            do_sample = True,
            temperature = 0.8,
            top_p = 0.9)

Write a poem about the Python language. Python has a lot of different ways to write code. If you've used Python before, you're probably familiar with the following: Python has a lot of different ways to write code. If you've used Python before, you're probably familiar with the following: 1) A Python file is a file that contains a Python program. 2) A Python file is a file that contains a Python program. 3) A Python file is a file that contains a Python program. 4) A Python file is a file that contains a Python program. 5) A Python file is a file that contains

Same result.

Feb 13 '24 05:02 meomeomeome

Hello! I believe the issue may not be related to quantization. I ran your prompt on not quantized LLama-2 7b. And got repetition too(see images below). This is known problem. You can tune generation parameters(like repetition_penalty=1.1 e.t.c.) to minimize probability of getting repetition. 2024-02-14 15 03 25 ).

P.s. BTW llama-2 7B model is not a instruction tuned model. So it would not work best at this scenarios. For chat like behavior it is better to use instruction tuned models.

Feb 14 '24 11:02 Vahe1994

P.s. BTW llama-2 7B model is not a instruction tuned model. So it would not work best at this scenarios. For chat like behavior it is better to use instruction tuned models.

Yes, indeed, the conclusion of the original is not very good)))

Feb 14 '24 16:02 meomeomeome

@Vahe1994 in your repos you have to AQLM and SpQR Which of these methods do you think is better and why? maybe one grew out of the other? Can you explain?

Also a question: when you quantized LLama-7b, how much memory did you spend on quantization during the peak load of the quantization script?

Feb 14 '24 16:02 meomeomeome

I am not @Vahe1994 , but to the best of my knowledge, AQLM compared their results in SPQR

For instance, here's table 10 on page 15 in AQLM

This suggests that AQLM is better at 4-bit compression - and other tables report similar results for 2 and 3 bits

Feb 29 '24 17:02 justheuristic

This issue is stale because it has been open for 30 days with no activity.

Mar 31 '24 01:03 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 14 '24 01:04 github-actions[bot]