ollama
ollama copied to clipboard
Please provide Q_2 for Llama 3.1 405B
This quantisation is missing rn...
I think you can use the ollama command to generate a quantized version of the model to your specification. See the help, it looks something like:
❯ ollama create -h Create a model from a Modelfile
Usage:
ollama create MODEL [flags]
Flags:
-f, --file string Name of the Modelfile (default "Modelfile")
-h, --help help for create
-q, --quantize string Quantize model to this level (e.g. q4_0)
It's no big deal for small model, but for big model like 405b, it will take ages to download full fp16 and quantize.
It would be amazing if Ollama library has other quants for llama 3.1 405b.
We currently don't have access to the full fp16 version on Ollama; the default configuration for the 405B model is Q_4. Maybe it's possible to "downquantize" it to Q_2, or alternatively, source the full fp16 version from another source. Both options involve considerable effort, and I will need to determine whether the 405B in Q_2 offers a significant advantage over the 70B in FP16. 😅
We currently don't have access to the full fp16 version on Ollama
I thought Meta provides llama3.1-405b in fp16? Can't Ollama team just convert it to gguf and quantize based on that?
I mean we, the users, I send some of my posts to gpt for grammar correction and it sometimes messes it up
For now, you can download q2 from here and import to Ollama. https://huggingface.co/mradermacher/Meta-Llama-3.1-405B-Instruct-GGUF/tree/main
I tried to do that from this exact source, and I imported it to ollama, but it failed to run imported model properly, I have no experience with adding external models yet. I'll wait few days until someone will solve how to do it :)
Well I guess we should be able to quantize lower precision versions like Q2 and Q1 from the Q4 that is already provided? However that means doing some manual work locally... As people reported Llama 3 70B to still be decent in Q2, Q1 may really be exciting with 405B. Can ollama or llama.cpp do this? May be a single command right?
For now, you can download q2 from here and import to Ollama. https://huggingface.co/mradermacher/Meta-Llama-3.1-405B-Instruct-GGUF/tree/main
This appears to have been deleted, which is a little concerning.
Try this one then: https://huggingface.co/bullerwins/Meta-Llama-3.1-405B-Instruct-GGUF/tree/main
It is possible for someone who done it to upload it on ollama on his account and share the link? Best
Llama3.1 are being updated : https://ollama.com/library/llama3.1/tags and 405B is provided in multiple Quantization (Q2... Q8). @gileneusz Please close this issue to keep issues under 1000.
Great. But can we also request Q1 for experimenting?
Why Quantization to 1 Bit (Q1) is Ineffective: Loss of Precision: Quantizing to 1 bit means each weight and activation is represented by a single bit, allowing only two states (e.g., -1 and 1). This drastically reduces precision. Large language models (LLMs) rely heavily on the precision of weights, and such an extreme reduction can significantly degrade the model's performance. Complex Information: LLMs handle complex textual data and subtle relationships between words. Q1 quantization cannot capture these nuances, leading to a significant drop in output quality. Inference Performance: The loss of information due to aggressive quantization can increase inference errors, possibly requiring costly correction mechanisms or additional inference steps, negating any speed and efficiency gains. Why Quantization to 2 Bits (Q2) is Not Ideal: Marginal Improvement: Although Q2 offers a slight improvement over Q1 by allowing four possible states, it is still very limited compared to 4, 8, 16, or 32-bit representations. Fine nuances in weights and activations are still largely lost, which can noticeably degrade model performance.
This is theory which was not really tested yet on large model like 405B. Real world tests done to date showed that lower bit quantizations get better performance the bigger the model is. Llama 3 70B still somehow performs in Q2, 405B may well also run in Q1. So that is why I am suggesting yet to do Q1 here. I would really appreciate it as it may show interesting things.
It is thought that LLMs have some kind of redundancy built inside the network (which may compensate for quantization) so I am really excited to see where is the quantization limit for 405B.
@igorschlum Did you make the Q2 quants yourself or where do they come from?
@kozuch do you mean for checking if Q1 405B would be better than FP16 70B? Then it could be worth it to add Q1 405B as well....
I'll close the issue tomorrow if no more answers will drop until then
@kozuch interesting, I will try the 405b-instruct-q2_K model on my MacStudio with 192 GB of RAM. I will try to work with it. I've seen that we can ask macOS to allocate more than 66% of the RAM to the GPU. I would like Meta to launch a 180GB model like Falcon did 9 months ago. It was using 101GB of RAM and that was perfect.
@igorschlum seems like there is an issue with running q2_K
ollama run llama3.1:405b-instruct-q2_K
Error: llama runner process has terminated: error:done_getting_tensors: wrong number of tensors; expected 1138, got 1137
I'm using Mac Studio 192GB
You can allocate up to 188 VRAM https://www.reddit.com/r/LocalLLaMA/comments/192uirj/188gb_vram_on_mac_studio_m2_ultra_easy/
Well the 405b-instruct-q2_K has 2.98 bits per weight (bpw, 151/405*8) so that is almost 3 bits. We also need lower Q2 quants like Q2_K_S.
@gileneusz You should report the bug above in a separate issue.
I did not know that you can allocate standard RAM to GPU on Mac. very interesting.
Also, there are only quants for 405b instruct, not base (text). Recently Ollama library uploaded base model for 8b and 70b, but not 405b.
@gileneusz I searched on Google for the error: "llama runner process has terminated: error: done_getting_tensors: wrong number of tensors" and it seems that this issue should be resolved in latest versions of Llama 3.1 and Llama.cpp.
Ollama version 0.3.4 is currently available. Have you tried it to see if the problem is resolved? The Llama 3.1 models are being updated right now on ollama.com, and I hope that Ollama 0.3.4 will work well with those updated models.
@igorschlum It runs after update to 0.3.3, however it fits all the 192GB of RAM and even swap for 10GB, I need to run this on fresh system, since I have many other stuff loaded into shared RAM, it should work. Very slow inference speed and quality is not better than q4_0 70B, but it's my opinion after just few tokens 😅
without 2xH100 or 4xH100 it doesn't make any sense...
But I'm going to have access to that hardware soon, so those quants will be very useful 😇
@gileneusz I will use Ollama for translation, so I hope that 405B with works well. I will see. Thank you.
It seems like meta updated 405b?
https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f
"Without explanation, Meta changed the number of KV heads from 16 to 8 (which now matches the whitepaper) for the 405B model. This is not just a config change, the whole model has been updated 😵"
https://www.reddit.com/r/LocalLLaMA/comments/1eoin62/meta_just_pushed_a_new_llama_31_405b_to_hf/
I try some prompts and yes, the results are worst with 405b:q3_K_S than 70b_q8_0. I try ollama run llama3.1:405b:q3_K_S and after loads and swaps, I got an Error : Killed. the model is 195 Gb and at the best, my mac is only 192.