llama.cpp Feature Request: Instructions how to correctly use/convert original llama3.1 instruct .pth model

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Can someone please add (or point me to) instructions to correctly set everything to get from FaceMeta-.pth downloaded weights to .gguf (and then onwards to Q8_0)?

I am running a local 8B instance with llama-server and CUDA.

Keep up the great work!

Motivation

With all the half-broken llama3.1 gguf files uploaded to hf by brownie point kids, it would make sense to drop a few words on how to convert and quantize the original/official Meta llama 3.1 weights for use with a local llama.cpp. (Somehow everyone seems to get the weights from hf, but why not source these freely available weights from the actual source?)

My tries still leave me hazy on whether the rope scaling is done correctly, even though I use latest transformers (for .pth to .safetensors) and then latest git version of llama.cpp for convert_hf_to_gguf.py.

The closest description I could find (edit: note that this is valid for llama3, not llama3.1 with the larger 128k token context) is here: https://voorloopnul.com/blog/quantize-and-run-the-original-llama3-8b-with-llama-cpp/

Possible Implementation

Please add two lines on llama3.1 "from META.pth to GGUF" to a README or to an answer to this issue.

Aug 01 '24 09:08 scalvin1

Hey, here's what I've captured so far. I have the model stored at ~/meta-llama-3.1-8b-instruct.

vector jiff ~/repo $ git clone [email protected]:meta-llama/llama-recipes.git
vector jiff ~/repo $ git clone [email protected]:huggingface/transformers.git
vector jiff ~/repo $ cd transformers
vector jiff ~/repo/transformers $ python3 -m venv .
vector jiff ~/repo/transformers $ source bin/activate
(transformers) vector jiff ~/repo/transformers $ pip install -r ../llama-recipes/requirements.txt
(transformers) vector jiff ~/repo/transformers $ pip install protobuf blobfile
(transformers) vector jiff ~/repo/transformers $ python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ~/meta-llama-3.1-8b-instruct --model_size 8B --llama_version 3.1 --output_dir ~/meta-llama-3.1-8b-instruct

Source

Next you'll need to also pull some files from the Hugging Face repository into the model directory.

vector jiff ~/meta-llama-3.1-8b-instruct $ git init .
vector jiff ~/meta-llama-3.1-8b-instruct $ git remote add origin https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
vector jiff ~/meta-llama-3.1-8b-instruct $ GIT_LFS_SKIP_SMUDGE=1 git fetch
vector jiff ~/meta-llama-3.1-8b-instruct $ git checkout origin/main -- config.json generation_config.json special_tokens_map.json tokenizer.json tokenizer_config.json

Finally, run the quantisation steps as instructed.

vector jiff ~/repo $ git clone https://github.com/ggerganov/llama.cpp
vector jiff ~/repo $ cd llama.cpp
vector jiff ~/repo/llama.cpp $ git log -1 --pretty=format:"%H - %s" origin/HEAD
afbb4c1322a747d2a7b4bf67c868148f8afcc6c8 - ggml-cuda: Adding support for unified memory (#8035)
vector jiff ~/repo/llama.cpp $ python3 -m venv .
vector jiff ~/repo/llama.cpp $ source bin/activate
(llama.cpp) vector jiff ~/repo/llama.cpp $ pip install -r requirements.txt
(llama.cpp) vector jiff ~/repo/llama.cpp $ pip install transformers~=4.43.3
(llama.cpp) vector jiff ~/repo/llama.cpp $ ln -s ~/meta-llama-3.1-8b-instruct models/meta-llama-3.1-8b-instruct 
(llama.cpp) vector jiff ~/repo/llama.cpp $ python3 convert_hf_to_gguf.py --outtype bf16 models/meta-llama-3.1-8b-instruct --outfile models/meta-llama-3.1-8b-instruct/meta-llama-3.1-8b-instruct-bf16.gguf

Unfortunately, I get stuck at this stage with the following error. I haven't been able to resolve this yet.

Traceback (most recent call last):
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 3717, in <module>
    main()
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 3711, in main
    model_instance.write()
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 401, in write
    self.prepare_metadata(vocab_only=False)
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 394, in prepare_metadata
    self.set_vocab()
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 1470, in set_vocab
    self._set_vocab_sentencepiece()
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 693, in _set_vocab_sentencepiece
    tokens, scores, toktypes = self._create_vocab_sentencepiece()
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 713, in _create_vocab_sentencepiece
    tokenizer.LoadFromFile(str(tokenizer_path))
  File "/home/jiff/predictors/repos/llama.cpp/lib/python3.11/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal: could not parse ModelProto from models/meta-llama-3.1-8b-instruct/tokenizer.model

Aug 02 '24 00:08 JiffSlater

You are missing some python package (protobuf?). Pull it in with pip install.

What you describe is what I gathered from https://voorloopnul.com/blog/quantize-and-run-the-original-llama3-8b-with-llama-cpp/ and I managed to get a working gguf file. I could also quantize to Q8_0 without problems and I have a model that seems to work alright. What I am just not very sure about is whether the rope scaling for context buffers up to 128k work correctly. In the light of some comments to similar issue here https://www.reddit.com/r/LocalLLaMA/comments/1eeyw0v/i_keep_getting_this_error_in_llama_31_8b_llamacpp/ for example, I always get 291 tensors after my conversions, and I never got any complaints about llama.cpp expecting 292 (the correctly converted model would have this extra one). This is puzzling me.

Do we need to go through the described procedure (above, or voorloop link) with special flags/parameters?
Does the quantization need special flags?
Does the llama-cli or llama-server call need special flags/parameters?

As said before, I would like to see an official description somewhere on how to correctly make use of the official Meta Llama 3.1 (Instruct) weights with llama.cpp. Also maybe, a few words on how to test and make sure large contexts work correctly.

Aug 02 '24 07:08 scalvin1

@xocite, I ran into a similar problem. And found it could be due to a missing config file. Specifically, the tokenizer_config.json file? I haven't tried to resolve the issue fully, but I did make a gist where I used the save_pretrained method to download the tokenizer and tokenizer_config JSON file: https://gist.github.com/brandenvs/2dad0e864fc2a0a5ee176213ae43b902

Aug 02 '24 11:08 brandenvs

@scalvin1 I ran it well by following your instructions

Aug 03 '24 03:08 Justin-12138

I tried again by downloading the model directly from HF and it quantised fine using the convert_hf_to_gguf.py script.

Aug 04 '24 01:08 JiffSlater

Your comments are beside the point. What I want is clear instructions to convert the model downloaded directly from Meta to the best-possible feature complete Q8_0 gguf. I want to cut out the dodgy middle men and make sure that I know what I get.

To spell it out again: huggingface is not the primary source of the llama weights and there is no reason to trust the uploaders to have done the conversion optimally. Especially if there are no clear instructions in this project on how this should be done.

Aug 05 '24 00:08 scalvin1

I'm a bit confused, isn't https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct the official Meta repo? It already contans the model converted from pytorch, so it can be converted further to gguf using the convert_hf_to_gguf.py script

Aug 07 '24 06:08 quitrk

As final words in this neglected issue, I summarize my findings.

To the commenter above: huggingface is not the official repository. You download it from Zuckerboy's facebook here https://llama.meta.com/llama-downloads/

What needs to happen next is converting it to huggingface format using the transformers library and python venv (it will download Gigabytes of wheels...)

(as in voorloop webpage) Create a virtual environment:

python3 -m venv .venv

Activate the environment

source .venv/bin/activate

Install the transformer libraries and dependencies (possibly protobuf and some other missing dependencies?)

pip install transformers transformers[torch] tiktoken blobfile sentencepiece

Convert it

python .venv/lib/python3.11/site-packages/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir Meta-Llama-3.1-8B-Instruct/ --model_size 8B --output_dir hf --llama_version 3.1 --instruct True

After that, convert it to 32bit gguf (to avoid any loss in fidelity)

python3 convert_hf_to_gguf.py --outtype f32 --outfile ../meta-llama-3.1-8B-instruction_f32.gguf ../hf/

Finally, quantize it to whatever fits your hardware. A good choice is 8bit:

llama.cpp-testing/llama-quantize meta-llama-3.1-8B-instruction_f32.gguf meta-llama-3.1-8B-instruction_Q8_0.gguf Q8_0

The above two steps will generate gguf models with 291 tensors that seem to work with longer contexts (note that longer contexts seem to need a lot more RAM or VRAM)

Note: I have not validated this approach and was hoping someone in the know could drop some official comments as to how correctly apply the process outlined here.

Anyway, it seems to work for me like this with no completely obvious flaws.

Aug 07 '24 14:08 scalvin1

What I meant was whether the repo itself is the official representation of Meta on Huggingface, which it seems to be the case

Aug 08 '24 07:08 quitrk

Your comments are beside the point. What I want is clear instructions to convert the model downloaded directly from Meta to the best-possible feature complete Q8_0 gguf. I want to cut out the dodgy middle men and make sure that I know what I get.

To spell it out again: huggingface is not the primary source of the llama weights and there is no reason to trust the uploaders to have done the conversion optimally. Especially if there are no clear instructions in this project on how this should be done.

Hi @scalvin1 - I'm VB from the open source team at Hugging Face. We're not a middle man. The weights uploaded in the meta-llama org are the official weights and converted together with Meta.

The steps you mentioned is exactly how Meta converted the weights as well. Everything works seamlessly now, but this required changes wrt to RoPE scaling you can read more about it here: https://github.com/ggerganov/llama.cpp/issues/8650

Let me know if you have any questions! 🤗

Aug 08 '24 07:08 Vaibhavs10

Trying to go from downloading the raw Llama 3.1 weights from Meta and use them for inference in Python led me here. It was partly due to wanting to manually handle the download of the weights (rather than pass a repo as a parameter) and partly because I wanted a better understanding (and control) of the formats for fine-tuning.

Inspired by this thread and the resources linked here, I put together a guide for taking the raw .pth weights and getting inference running in a Python script with llama.cpp: https://github.com/codiak/llama-raw-to-py

Aug 14 '24 18:08 codiak

There is not much to add anymore, it all seems to be working the way it was described. Closing issue.

Aug 20 '24 09:08 scalvin1

Thank you for this reply, I was lloking for this. I have a question: I have 16 GB RAM and 6 GB VRAM [NVIDIA GeForce RTX 4050]

When I try this command: python ~/.local/lib/python3.10/site-packages/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ../../Meta-Llama-3.1-8B-Instruct/original --model_size 8B --output_dir hf --llama_version 3.1 --instruct True I got killed error

Is there a workaround for this, also map location is fixed as cpu in convert script.

Sep 17 '24 12:09 sglbl