llama.cpp Add llama 3.1 rope scaling factors to llama conversion and inference

Hi all, this commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the ggml_rope_ext rope operation.

From our testing, this really improves results for context windows above 8192 for Llama 3.1.

This should fix https://github.com/ggerganov/llama.cpp/issues/8650

[x] I have read the contributing guidelines
Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High

Jul 24 '24 19:07 jmorganca

Marking this as draft as there may also be a change required to the rope kernels – investigating!

Jul 24 '24 20:07 jmorganca

Probably it's not a good idea to change the kernels. I just tested creating the factors instead as in my comment above. Here's the changes on top of your PR,

diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 4422948f..897a6e33 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -1518,7 +1518,7 @@ class LlamaModel(Model):
             if rope_scaling.get("rope_type", '').lower() == "llama3":
                 base = hparams.get("rope_theta", 10000.0)
                 dim = int((hparams["hidden_size"] // hparams["num_attention_heads"]) * hparams.get("partial_rotary_embeddings", 1.0))
-                inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+                freqs = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
 
                 factor = rope_scaling.get("factor", 8.0)
                 low_freq_factor = rope_scaling.get("low_freq_factor", 1.0)
@@ -1528,20 +1528,20 @@ class LlamaModel(Model):
                 low_freq_wavelen = old_context_len / low_freq_factor
                 high_freq_wavelen = old_context_len / high_freq_factor
 
-                rope_freqs = []
-                for freq in inv_freq:
+                rope_factors = []
+                for freq in freqs:
                     wavelen = 2 * math.pi / freq
                     if wavelen < high_freq_wavelen:
-                        rope_freqs.append(freq)
+                        rope_factors.append(1)
                     elif wavelen > low_freq_wavelen:
-                        rope_freqs.append(freq / factor)
+                        rope_factors.append(factor)
                     else:
                         assert low_freq_wavelen != high_freq_wavelen
                         smooth = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
-                        rope_freqs.append((1 - smooth) * freq / factor + smooth * freq)
+                        rope_factors.append(1 / ((1 - smooth) / factor + smooth))
 
                 self.gguf_writer.add_rope_scaling_attn_factors(1.0)
-                self.gguf_writer.add_tensor(gguf.TENSOR_NAMES[gguf.MODEL_TENSOR.ROPE_FREQS] + ".weight", 1.0 / np.array(rope_freqs))
+                self.gguf_writer.add_tensor(gguf.TENSOR_NAMES[gguf.MODEL_TENSOR.ROPE_FREQS] + ".weight", np.array(rope_factors, dtype = np.float32))
 
     @staticmethod
     def permute(weights: Tensor, n_head: int, n_head_kv: int | None):

A quick test shows perplexity is slightly better than master branch ([1]4.6850,[2]5.9231),

perplexity: calculating perplexity over 17 chunks, n_ctx=16384, batch_size=2048, n_seq=1
perplexity: 191.29 seconds per pass - ETA 54.18 minutes
[1]4.3272,[2]5.5469,^C

Jul 24 '24 22:07 jxy

It should be

1 / ((1 - smooth) / factor + smooth)

Jul 24 '24 22:07 jxy

Awesome, thanks @jxy. Should be updated now. Agreed much better without the kernel changes.

Jul 24 '24 22:07 jmorganca

Nice. Tested doing a bunch of summaries using up the entire 128k context and the output looks good whereas on master it outputs broken garbage.

Jul 25 '24 07:07 kallewoof

Initial tests show improvement over master (model able to reason within bigger context - I tested up to -c 32768), but I've noticed one potential problem - GGUF made with this PR has n_tensors = 292 instead of 291 on master (extra rope_freqs.weight). It might cause errors in some apps checking it (like koboldcpp):

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291
llama_load_model_from_file: failed to load model
Traceback (most recent call last):
  File "koboldcpp.py", line 4208, in <module>
    main(parser.parse_args(),start_server=True)
  File "koboldcpp.py", line 3883, in main
    loadok = load_model(modelname)
  File "koboldcpp.py", line 773, in load_model
    ret = handle.load_model(inputs)
OSError: exception: access violation reading 0x0000000000001894
[21452] Failed to execute script 'koboldcpp' due to unhandled exception!

FYI @LostRuins

Jul 25 '24 12:07 MoonRide303

Initial tests show improvement over master (model able to reason within bigger context - I tested up to -c 32768), but I've noticed one potential problem - GGUF made with this PR has n_tensors = 292 instead of 291 on master (extra rope_freqs.weight). It might cause errors in some apps checking it (like koboldcpp):

Isn't this just a side-effect of kobold.cpp using a different llama.cpp version internally, which would need to be updated anyways to take advantage of these changes ?

Jul 25 '24 13:07 tristandruyen

@MoonRide303 : I merged this PR in Kobold.cpp Frankenstein this morning (v171013), and it works (sensical inference) without problem beyond 8k context now.

Jul 25 '24 13:07 Nexesenex

Initial tests show improvement over master (model able to reason within bigger context - I tested up to -c 32768), but I've noticed one potential problem - GGUF made with this PR has n_tensors = 292 instead of 291 on master (extra rope_freqs.weight). It might cause errors in some apps checking it (like koboldcpp):

Isn't this just a side-effect of kobold.cpp using a different llama.cpp version internally, which would need to be updated anyways to take advantage of these changes ?

I tested it on koboldcpp 1.71 (released today, having merged upstream changes yesterday) - but you might be right some changes from original llama.cpp are missing (or some extra changes might be problematic). I am not sure if it's something to worry about on the llama.cpp project side, just pointing it out as potential side effect - for some people new GGUFs might not work.

Jul 25 '24 13:07 MoonRide303

I wasn't aware that this PR breaks previously quantized Llama 3.1 models, I thought that all prior models would continue to work. Shouldn't the tensor count always match what's declared in the metadata regardless?

Anyway, I'll do a patch release once this is merged into master.

Jul 25 '24 13:07 LostRuins

I wasn't aware that this PR breaks previously quantized Llama 3.1 models, I thought that all prior models would continue to work. Shouldn't the tensor count always match what's declared in the metadata regardless?

Anyway, I'll do a patch release once this is merged into master.

Old GGUFs will continue to work, it's just newly converted models (with this PR changes) that might not work with older llama.cpp-based apps (like koboldcpp 1.71).

Jul 25 '24 13:07 MoonRide303

Yeah but that's my point - it's failing to load because the tensor counts don't match the metadata, which it should in both cases, unless I misunderstand the error.

Jul 25 '24 13:07 LostRuins

Merging with koboldcpp concedo_experimental works fine on my end, FWIW.

Jul 25 '24 13:07 kallewoof

Why couldn't this tensor be added by llama.cpp when loading? superficially it doesn't make much sense to bake the rope config into the model at conversion time, and prevents bugfixes at a later time.

Jul 25 '24 13:07 schmorp

Thanks! Given the hard check on the tensor count at load time, it seems we might need to instead add the rope parameters as metadata and calculate it at runtime. Definitely did not want to break backwards compatibility with old GGUF files with this change sorry!

Jul 25 '24 15:07 jmorganca

Thank you for looking into this. Btw even if it is in the metadata and the rope freqs are computed at runtime, it will still not be backwards compatible because the old GGUF files will not have the metadata

Jul 25 '24 16:07 ggerganov

Storing the information as metadata and computing at runtime seems preferable to me, as I think it seems more correct to have the tensor count match that of loading the model in HF transformers, for example (and just reduce potential confusion and people making incorrect assumptions based on mismatched tensor count). One could bake in defaults for when the metadata is missing from the gguf - that would make it backwards compatible, if desired.

Jul 25 '24 19:07 gilbertgong

Thank you for your effort on this, we are all waiting patiently for the rope to be finished! We have some great functionality in GPT4All with this new model!

Jul 25 '24 22:07 3Simplex

@ggerganov do you know if there a preferred way to create tensors from scalar values (vs using ml.create_tensor)? Working on an iteration of this PR that does this (vs setting it as a tensor in the GGUF file)

Jul 26 '24 01:07 jmorganca

I tried testing this (building off this branch and doing a fresh Q8 gguf with the resulting convert script), in comparison to simply using --rope-freq-base 8000000 with the unpatched binary and just the change to llama-bpe (as suggested in the linked thread), and actually simply using --rope-freq-base 8000000 seemed to give overall higher quality from a subjective perspective. Have others had good results with this patch?

Jul 26 '24 06:07 gilbertgong

@jmorganca Creation of tensors at runtime hasn't been needed in the past, so we haven't considered it yet. I think it would require to create additional contexts for runtime-computed tensors for each layer (see llama_kv_cache_init for something similar). Then the runtime data has to be computed on the CPU and applied to the tensors with ggml_backend_tensor_set.

IMO it would be much simpler to keep the frequency computation logic in the Python code and proceed with the current PR as it is (+ addressing @Galunid and @compilade comments). This way it is also more inline with the Phi3 long/short rope frequency logic. I don't think there are significant advantages to computing the rope freqs at runtime from the metadata and moreover both approaches are backwards incompatible so new GGUFs will have to be created either way.

Given the hard check on the tensor count at load time

Not sure which logic is that, can you clarify?

Jul 26 '24 09:07 ggerganov

@ggerganov okay, thanks. I can address the comments to unblock the PR

This is the check https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L4142 that seems to cause the error:

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291

Do you know if there's a way we might be able to work around that somehow? The thinking around computing them at runtime is old clients will continue to run the new models (albeit not well with n_ctx > 8192)

Jul 26 '24 09:07 jmorganca

The thinking around computing them at runtime is old clients will continue to run the new models (albeit not well with n_ctx > 8192)

The proposed changes to the RoPE also affect n_ctx < 8192. So even if old clients were able to use the new models, the generations would be technically incorrect even for shorter contexts, because of the missing factors. Would this be desired?

But regardless, I can't think of a simple workaround that would enable this

Jul 26 '24 09:07 ggerganov

I'd rather see the old llama-3.1 GGUF's totally break rather than risking users downloading an old model with degraded quality and giving them a bad impression. A program ~~crash~~ error+exit will queue the user to investigate what's wrong. Degraded performance just might make the user think the software/model is of low quality.

Jul 26 '24 13:07 m18coppola

A program crash will queue the user to investigate what's wrong. Degraded performance just might make the user think the software/model is of low quality.

Exactly! I see no reason to make it backward compatible.

Jul 26 '24 14:07 qnixsynapse

Crashing outright is bad, a better result would be a warning printed to stdout. It's always better to fail gracefully when possible.

sort of like how we did this: https://github.com/ggerganov/llama.cpp/blob/01245f5b1629075543bc4478418c7d72a0b4b3c7/src/llama.cpp#L5344-L5352

Jul 26 '24 14:07 LostRuins

It doesn't crash, it causes llama_load_model_from_file to return an error.

Jul 26 '24 14:07 slaren

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

Jul 26 '24 21:07 gilbertgong

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

We need more of this mentality around here. There's usually a way

Jul 26 '24 21:07 oldgithubman

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

If so, I think there should be a warning similar to what @LostRuins mentioned.

Jul 26 '24 21:07 m18coppola