ollama icon indicating copy to clipboard operation
ollama copied to clipboard

Please provide Q_2 for Llama 3.1 405B

Open gileneusz opened this issue 1 year ago • 10 comments
trafficstars

This quantisation is missing rn...

gileneusz avatar Jul 23 '24 20:07 gileneusz

I think you can use the ollama command to generate a quantized version of the model to your specification. See the help, it looks something like:

❯ ollama create -h Create a model from a Modelfile

Usage:
ollama create MODEL [flags]

Flags:
-f, --file string Name of the Modelfile (default "Modelfile")
-h, --help help for create -q, --quantize string Quantize model to this level (e.g. q4_0)

JamesStallings avatar Jul 23 '24 23:07 JamesStallings

It's no big deal for small model, but for big model like 405b, it will take ages to download full fp16 and quantize.

It would be amazing if Ollama library has other quants for llama 3.1 405b.

chigkim avatar Jul 24 '24 00:07 chigkim

We currently don't have access to the full fp16 version on Ollama; the default configuration for the 405B model is Q_4. Maybe it's possible to "downquantize" it to Q_2, or alternatively, source the full fp16 version from another source. Both options involve considerable effort, and I will need to determine whether the 405B in Q_2 offers a significant advantage over the 70B in FP16. 😅

gileneusz avatar Jul 24 '24 06:07 gileneusz

We currently don't have access to the full fp16 version on Ollama

I thought Meta provides llama3.1-405b in fp16? Can't Ollama team just convert it to gguf and quantize based on that?

chigkim avatar Jul 24 '24 07:07 chigkim

I mean we, the users, I send some of my posts to gpt for grammar correction and it sometimes messes it up

gileneusz avatar Jul 24 '24 07:07 gileneusz

For now, you can download q2 from here and import to Ollama. https://huggingface.co/mradermacher/Meta-Llama-3.1-405B-Instruct-GGUF/tree/main

chigkim avatar Jul 24 '24 19:07 chigkim

I tried to do that from this exact source, and I imported it to ollama, but it failed to run imported model properly, I have no experience with adding external models yet. I'll wait few days until someone will solve how to do it :)

gileneusz avatar Jul 24 '24 20:07 gileneusz

Well I guess we should be able to quantize lower precision versions like Q2 and Q1 from the Q4 that is already provided? However that means doing some manual work locally... As people reported Llama 3 70B to still be decent in Q2, Q1 may really be exciting with 405B. Can ollama or llama.cpp do this? May be a single command right?

kozuch avatar Jul 28 '24 10:07 kozuch

For now, you can download q2 from here and import to Ollama. https://huggingface.co/mradermacher/Meta-Llama-3.1-405B-Instruct-GGUF/tree/main

This appears to have been deleted, which is a little concerning.

gwillen avatar Jul 28 '24 21:07 gwillen

Try this one then: https://huggingface.co/bullerwins/Meta-Llama-3.1-405B-Instruct-GGUF/tree/main

chigkim avatar Jul 28 '24 23:07 chigkim

It is possible for someone who done it to upload it on ollama on his account and share the link? Best

igorschlum avatar Aug 02 '24 17:08 igorschlum

Llama3.1 are being updated : https://ollama.com/library/llama3.1/tags and 405B is provided in multiple Quantization (Q2... Q8). @gileneusz Please close this issue to keep issues under 1000.

igorschlum avatar Aug 06 '24 07:08 igorschlum

Great. But can we also request Q1 for experimenting?

kozuch avatar Aug 06 '24 08:08 kozuch

Why Quantization to 1 Bit (Q1) is Ineffective: Loss of Precision: Quantizing to 1 bit means each weight and activation is represented by a single bit, allowing only two states (e.g., -1 and 1). This drastically reduces precision. Large language models (LLMs) rely heavily on the precision of weights, and such an extreme reduction can significantly degrade the model's performance. Complex Information: LLMs handle complex textual data and subtle relationships between words. Q1 quantization cannot capture these nuances, leading to a significant drop in output quality. Inference Performance: The loss of information due to aggressive quantization can increase inference errors, possibly requiring costly correction mechanisms or additional inference steps, negating any speed and efficiency gains. Why Quantization to 2 Bits (Q2) is Not Ideal: Marginal Improvement: Although Q2 offers a slight improvement over Q1 by allowing four possible states, it is still very limited compared to 4, 8, 16, or 32-bit representations. Fine nuances in weights and activations are still largely lost, which can noticeably degrade model performance.

igorschlum avatar Aug 06 '24 08:08 igorschlum

This is theory which was not really tested yet on large model like 405B. Real world tests done to date showed that lower bit quantizations get better performance the bigger the model is. Llama 3 70B still somehow performs in Q2, 405B may well also run in Q1. So that is why I am suggesting yet to do Q1 here. I would really appreciate it as it may show interesting things.

It is thought that LLMs have some kind of redundancy built inside the network (which may compensate for quantization) so I am really excited to see where is the quantization limit for 405B.

@igorschlum Did you make the Q2 quants yourself or where do they come from?

kozuch avatar Aug 06 '24 08:08 kozuch

@kozuch do you mean for checking if Q1 405B would be better than FP16 70B? Then it could be worth it to add Q1 405B as well....

I'll close the issue tomorrow if no more answers will drop until then

gileneusz avatar Aug 06 '24 09:08 gileneusz

@kozuch interesting, I will try the 405b-instruct-q2_K model on my MacStudio with 192 GB of RAM. I will try to work with it. I've seen that we can ask macOS to allocate more than 66% of the RAM to the GPU. I would like Meta to launch a 180GB model like Falcon did 9 months ago. It was using 101GB of RAM and that was perfect.

igorschlum avatar Aug 06 '24 09:08 igorschlum

@igorschlum seems like there is an issue with running q2_K

ollama run llama3.1:405b-instruct-q2_K 
Error: llama runner process has terminated: error:done_getting_tensors: wrong number of tensors; expected 1138, got 1137

I'm using Mac Studio 192GB

You can allocate up to 188 VRAM https://www.reddit.com/r/LocalLLaMA/comments/192uirj/188gb_vram_on_mac_studio_m2_ultra_easy/

gileneusz avatar Aug 06 '24 10:08 gileneusz

Well the 405b-instruct-q2_K has 2.98 bits per weight (bpw, 151/405*8) so that is almost 3 bits. We also need lower Q2 quants like Q2_K_S.

@gileneusz You should report the bug above in a separate issue.

I did not know that you can allocate standard RAM to GPU on Mac. very interesting.

kozuch avatar Aug 06 '24 13:08 kozuch

Also, there are only quants for 405b instruct, not base (text). Recently Ollama library uploaded base model for 8b and 70b, but not 405b.

chigkim avatar Aug 06 '24 13:08 chigkim

@gileneusz I searched on Google for the error: "llama runner process has terminated: error: done_getting_tensors: wrong number of tensors" and it seems that this issue should be resolved in latest versions of Llama 3.1 and Llama.cpp.

Ollama version 0.3.4 is currently available. Have you tried it to see if the problem is resolved? The Llama 3.1 models are being updated right now on ollama.com, and I hope that Ollama 0.3.4 will work well with those updated models.

igorschlum avatar Aug 06 '24 21:08 igorschlum

@igorschlum It runs after update to 0.3.3, however it fits all the 192GB of RAM and even swap for 10GB, I need to run this on fresh system, since I have many other stuff loaded into shared RAM, it should work. Very slow inference speed and quality is not better than q4_0 70B, but it's my opinion after just few tokens 😅

without 2xH100 or 4xH100 it doesn't make any sense...

But I'm going to have access to that hardware soon, so those quants will be very useful 😇

gileneusz avatar Aug 07 '24 01:08 gileneusz

@gileneusz I will use Ollama for translation, so I hope that 405B with works well. I will see. Thank you.

igorschlum avatar Aug 08 '24 16:08 igorschlum

It seems like meta updated 405b?

https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f

"Without explanation, Meta changed the number of KV heads from 16 to 8 (which now matches the whitepaper) for the 405B model. This is not just a config change, the whole model has been updated 😵"

https://www.reddit.com/r/LocalLLaMA/comments/1eoin62/meta_just_pushed_a_new_llama_31_405b_to_hf/

chigkim avatar Aug 10 '24 12:08 chigkim

I try some prompts and yes, the results are worst with 405b:q3_K_S than 70b_q8_0. I try ollama run llama3.1:405b:q3_K_S and after loads and swaps, I got an Error : Killed. the model is 195 Gb and at the best, my mac is only 192.

igorschlum avatar Aug 11 '24 22:08 igorschlum