Feature Request: Instructions how to correctly use/convert original llama3.1 instruct .pth model
Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the README.md.
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Can someone please add (or point me to) instructions to correctly set everything to get from FaceMeta-.pth downloaded weights to .gguf (and then onwards to Q8_0)?
I am running a local 8B instance with llama-server and CUDA.
Keep up the great work!
Motivation
With all the half-broken llama3.1 gguf files uploaded to hf by brownie point kids, it would make sense to drop a few words on how to convert and quantize the original/official Meta llama 3.1 weights for use with a local llama.cpp. (Somehow everyone seems to get the weights from hf, but why not source these freely available weights from the actual source?)
My tries still leave me hazy on whether the rope scaling is done correctly, even though I use latest transformers (for .pth to .safetensors) and then latest git version of llama.cpp for convert_hf_to_gguf.py.
The closest description I could find (edit: note that this is valid for llama3, not llama3.1 with the larger 128k token context) is here: https://voorloopnul.com/blog/quantize-and-run-the-original-llama3-8b-with-llama-cpp/
Possible Implementation
Please add two lines on llama3.1 "from META.pth to GGUF" to a README or to an answer to this issue.
Hey, here's what I've captured so far. I have the model stored at ~/meta-llama-3.1-8b-instruct.
vector jiff ~/repo $ git clone [email protected]:meta-llama/llama-recipes.git
vector jiff ~/repo $ git clone [email protected]:huggingface/transformers.git
vector jiff ~/repo $ cd transformers
vector jiff ~/repo/transformers $ python3 -m venv .
vector jiff ~/repo/transformers $ source bin/activate
(transformers) vector jiff ~/repo/transformers $ pip install -r ../llama-recipes/requirements.txt
(transformers) vector jiff ~/repo/transformers $ pip install protobuf blobfile
(transformers) vector jiff ~/repo/transformers $ python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ~/meta-llama-3.1-8b-instruct --model_size 8B --llama_version 3.1 --output_dir ~/meta-llama-3.1-8b-instruct
Next you'll need to also pull some files from the Hugging Face repository into the model directory.
vector jiff ~/meta-llama-3.1-8b-instruct $ git init .
vector jiff ~/meta-llama-3.1-8b-instruct $ git remote add origin https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
vector jiff ~/meta-llama-3.1-8b-instruct $ GIT_LFS_SKIP_SMUDGE=1 git fetch
vector jiff ~/meta-llama-3.1-8b-instruct $ git checkout origin/main -- config.json generation_config.json special_tokens_map.json tokenizer.json tokenizer_config.json
Finally, run the quantisation steps as instructed.
vector jiff ~/repo $ git clone https://github.com/ggerganov/llama.cpp
vector jiff ~/repo $ cd llama.cpp
vector jiff ~/repo/llama.cpp $ git log -1 --pretty=format:"%H - %s" origin/HEAD
afbb4c1322a747d2a7b4bf67c868148f8afcc6c8 - ggml-cuda: Adding support for unified memory (#8035)
vector jiff ~/repo/llama.cpp $ python3 -m venv .
vector jiff ~/repo/llama.cpp $ source bin/activate
(llama.cpp) vector jiff ~/repo/llama.cpp $ pip install -r requirements.txt
(llama.cpp) vector jiff ~/repo/llama.cpp $ pip install transformers~=4.43.3
(llama.cpp) vector jiff ~/repo/llama.cpp $ ln -s ~/meta-llama-3.1-8b-instruct models/meta-llama-3.1-8b-instruct
(llama.cpp) vector jiff ~/repo/llama.cpp $ python3 convert_hf_to_gguf.py --outtype bf16 models/meta-llama-3.1-8b-instruct --outfile models/meta-llama-3.1-8b-instruct/meta-llama-3.1-8b-instruct-bf16.gguf
Unfortunately, I get stuck at this stage with the following error. I haven't been able to resolve this yet.
Traceback (most recent call last):
File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 3717, in <module>
main()
File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 3711, in main
model_instance.write()
File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 401, in write
self.prepare_metadata(vocab_only=False)
File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 394, in prepare_metadata
self.set_vocab()
File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 1470, in set_vocab
self._set_vocab_sentencepiece()
File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 693, in _set_vocab_sentencepiece
tokens, scores, toktypes = self._create_vocab_sentencepiece()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 713, in _create_vocab_sentencepiece
tokenizer.LoadFromFile(str(tokenizer_path))
File "/home/jiff/predictors/repos/llama.cpp/lib/python3.11/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal: could not parse ModelProto from models/meta-llama-3.1-8b-instruct/tokenizer.model
You are missing some python package (protobuf?). Pull it in with pip install.
What you describe is what I gathered from https://voorloopnul.com/blog/quantize-and-run-the-original-llama3-8b-with-llama-cpp/ and I managed to get a working gguf file. I could also quantize to Q8_0 without problems and I have a model that seems to work alright. What I am just not very sure about is whether the rope scaling for context buffers up to 128k work correctly. In the light of some comments to similar issue here https://www.reddit.com/r/LocalLLaMA/comments/1eeyw0v/i_keep_getting_this_error_in_llama_31_8b_llamacpp/ for example, I always get 291 tensors after my conversions, and I never got any complaints about llama.cpp expecting 292 (the correctly converted model would have this extra one). This is puzzling me.
- Do we need to go through the described procedure (above, or voorloop link) with special flags/parameters?
- Does the quantization need special flags?
- Does the llama-cli or llama-server call need special flags/parameters?
As said before, I would like to see an official description somewhere on how to correctly make use of the official Meta Llama 3.1 (Instruct) weights with llama.cpp. Also maybe, a few words on how to test and make sure large contexts work correctly.
@xocite, I ran into a similar problem. And found it could be due to a missing config file. Specifically, the tokenizer_config.json file? I haven't tried to resolve the issue fully, but I did make a gist where I used the save_pretrained method to download the tokenizer and tokenizer_config JSON file: https://gist.github.com/brandenvs/2dad0e864fc2a0a5ee176213ae43b902
@scalvin1 I ran it well by following your instructions
I tried again by downloading the model directly from HF and it quantised fine using the convert_hf_to_gguf.py script.
Your comments are beside the point. What I want is clear instructions to convert the model downloaded directly from Meta to the best-possible feature complete Q8_0 gguf. I want to cut out the dodgy middle men and make sure that I know what I get.
To spell it out again: huggingface is not the primary source of the llama weights and there is no reason to trust the uploaders to have done the conversion optimally. Especially if there are no clear instructions in this project on how this should be done.
I'm a bit confused, isn't https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct the official Meta repo? It already contans the model converted from pytorch, so it can be converted further to gguf using the convert_hf_to_gguf.py script
As final words in this neglected issue, I summarize my findings.
To the commenter above: huggingface is not the official repository. You download it from Zuckerboy's facebook here https://llama.meta.com/llama-downloads/
What needs to happen next is converting it to huggingface format using the transformers library and python venv (it will download Gigabytes of wheels...)
(as in voorloop webpage) Create a virtual environment:
python3 -m venv .venv
Activate the environment
source .venv/bin/activate
Install the transformer libraries and dependencies (possibly protobuf and some other missing dependencies?)
pip install transformers transformers[torch] tiktoken blobfile sentencepiece
Convert it
python .venv/lib/python3.11/site-packages/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir Meta-Llama-3.1-8B-Instruct/ --model_size 8B --output_dir hf --llama_version 3.1 --instruct True
After that, convert it to 32bit gguf (to avoid any loss in fidelity)
python3 convert_hf_to_gguf.py --outtype f32 --outfile ../meta-llama-3.1-8B-instruction_f32.gguf ../hf/
Finally, quantize it to whatever fits your hardware. A good choice is 8bit:
llama.cpp-testing/llama-quantize meta-llama-3.1-8B-instruction_f32.gguf meta-llama-3.1-8B-instruction_Q8_0.gguf Q8_0
The above two steps will generate gguf models with 291 tensors that seem to work with longer contexts (note that longer contexts seem to need a lot more RAM or VRAM)
Note: I have not validated this approach and was hoping someone in the know could drop some official comments as to how correctly apply the process outlined here.
Anyway, it seems to work for me like this with no completely obvious flaws.
What I meant was whether the repo itself is the official representation of Meta on Huggingface, which it seems to be the case
Your comments are beside the point. What I want is clear instructions to convert the model downloaded directly from Meta to the best-possible feature complete Q8_0 gguf. I want to cut out the dodgy middle men and make sure that I know what I get.
To spell it out again: huggingface is not the primary source of the llama weights and there is no reason to trust the uploaders to have done the conversion optimally. Especially if there are no clear instructions in this project on how this should be done.
Hi @scalvin1 - I'm VB from the open source team at Hugging Face. We're not a middle man. The weights uploaded in the meta-llama org are the official weights and converted together with Meta.
The steps you mentioned is exactly how Meta converted the weights as well. Everything works seamlessly now, but this required changes wrt to RoPE scaling you can read more about it here: https://github.com/ggerganov/llama.cpp/issues/8650
Let me know if you have any questions! 🤗
Trying to go from downloading the raw Llama 3.1 weights from Meta and use them for inference in Python led me here. It was partly due to wanting to manually handle the download of the weights (rather than pass a repo as a parameter) and partly because I wanted a better understanding (and control) of the formats for fine-tuning.
Inspired by this thread and the resources linked here, I put together a guide for taking the raw .pth weights and getting inference running in a Python script with llama.cpp:
https://github.com/codiak/llama-raw-to-py
There is not much to add anymore, it all seems to be working the way it was described. Closing issue.
Thank you for this reply, I was lloking for this. I have a question: I have 16 GB RAM and 6 GB VRAM [NVIDIA GeForce RTX 4050]
When I try this command:
python ~/.local/lib/python3.10/site-packages/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ../../Meta-Llama-3.1-8B-Instruct/original --model_size 8B --output_dir hf --llama_version 3.1 --instruct True I got killed error
Is there a workaround for this, also map location is fixed as cpu in convert script.