llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Add llama 3.1 rope scaling factors to llama conversion and inference

Open jmorganca opened this issue 1 year ago • 14 comments

Hi all, this commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the ggml_rope_ext rope operation.

From our testing, this really improves results for context windows above 8192 for Llama 3.1.

This should fix https://github.com/ggerganov/llama.cpp/issues/8650

jmorganca avatar Jul 24 '24 19:07 jmorganca

Marking this as draft as there may also be a change required to the rope kernels – investigating!

jmorganca avatar Jul 24 '24 20:07 jmorganca

Probably it's not a good idea to change the kernels. I just tested creating the factors instead as in my comment above. Here's the changes on top of your PR,

diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 4422948f..897a6e33 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -1518,7 +1518,7 @@ class LlamaModel(Model):
             if rope_scaling.get("rope_type", '').lower() == "llama3":
                 base = hparams.get("rope_theta", 10000.0)
                 dim = int((hparams["hidden_size"] // hparams["num_attention_heads"]) * hparams.get("partial_rotary_embeddings", 1.0))
-                inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+                freqs = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
 
                 factor = rope_scaling.get("factor", 8.0)
                 low_freq_factor = rope_scaling.get("low_freq_factor", 1.0)
@@ -1528,20 +1528,20 @@ class LlamaModel(Model):
                 low_freq_wavelen = old_context_len / low_freq_factor
                 high_freq_wavelen = old_context_len / high_freq_factor
 
-                rope_freqs = []
-                for freq in inv_freq:
+                rope_factors = []
+                for freq in freqs:
                     wavelen = 2 * math.pi / freq
                     if wavelen < high_freq_wavelen:
-                        rope_freqs.append(freq)
+                        rope_factors.append(1)
                     elif wavelen > low_freq_wavelen:
-                        rope_freqs.append(freq / factor)
+                        rope_factors.append(factor)
                     else:
                         assert low_freq_wavelen != high_freq_wavelen
                         smooth = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
-                        rope_freqs.append((1 - smooth) * freq / factor + smooth * freq)
+                        rope_factors.append(1 / ((1 - smooth) / factor + smooth))
 
                 self.gguf_writer.add_rope_scaling_attn_factors(1.0)
-                self.gguf_writer.add_tensor(gguf.TENSOR_NAMES[gguf.MODEL_TENSOR.ROPE_FREQS] + ".weight", 1.0 / np.array(rope_freqs))
+                self.gguf_writer.add_tensor(gguf.TENSOR_NAMES[gguf.MODEL_TENSOR.ROPE_FREQS] + ".weight", np.array(rope_factors, dtype = np.float32))
 
     @staticmethod
     def permute(weights: Tensor, n_head: int, n_head_kv: int | None):

A quick test shows perplexity is slightly better than master branch ([1]4.6850,[2]5.9231),

perplexity: calculating perplexity over 17 chunks, n_ctx=16384, batch_size=2048, n_seq=1
perplexity: 191.29 seconds per pass - ETA 54.18 minutes
[1]4.3272,[2]5.5469,^C

jxy avatar Jul 24 '24 22:07 jxy

It should be

1 / ((1 - smooth) / factor + smooth)

jxy avatar Jul 24 '24 22:07 jxy

Awesome, thanks @jxy. Should be updated now. Agreed much better without the kernel changes.

jmorganca avatar Jul 24 '24 22:07 jmorganca

Nice. Tested doing a bunch of summaries using up the entire 128k context and the output looks good whereas on master it outputs broken garbage.

kallewoof avatar Jul 25 '24 07:07 kallewoof

Initial tests show improvement over master (model able to reason within bigger context - I tested up to -c 32768), but I've noticed one potential problem - GGUF made with this PR has n_tensors = 292 instead of 291 on master (extra rope_freqs.weight). It might cause errors in some apps checking it (like koboldcpp):

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291
llama_load_model_from_file: failed to load model
Traceback (most recent call last):
  File "koboldcpp.py", line 4208, in <module>
    main(parser.parse_args(),start_server=True)
  File "koboldcpp.py", line 3883, in main
    loadok = load_model(modelname)
  File "koboldcpp.py", line 773, in load_model
    ret = handle.load_model(inputs)
OSError: exception: access violation reading 0x0000000000001894
[21452] Failed to execute script 'koboldcpp' due to unhandled exception!

FYI @LostRuins

MoonRide303 avatar Jul 25 '24 12:07 MoonRide303

Initial tests show improvement over master (model able to reason within bigger context - I tested up to -c 32768), but I've noticed one potential problem - GGUF made with this PR has n_tensors = 292 instead of 291 on master (extra rope_freqs.weight). It might cause errors in some apps checking it (like koboldcpp):

Isn't this just a side-effect of kobold.cpp using a different llama.cpp version internally, which would need to be updated anyways to take advantage of these changes ?

tristandruyen avatar Jul 25 '24 13:07 tristandruyen

@MoonRide303 : I merged this PR in Kobold.cpp Frankenstein this morning (v171013), and it works (sensical inference) without problem beyond 8k context now.

Nexesenex avatar Jul 25 '24 13:07 Nexesenex

Initial tests show improvement over master (model able to reason within bigger context - I tested up to -c 32768), but I've noticed one potential problem - GGUF made with this PR has n_tensors = 292 instead of 291 on master (extra rope_freqs.weight). It might cause errors in some apps checking it (like koboldcpp):

Isn't this just a side-effect of kobold.cpp using a different llama.cpp version internally, which would need to be updated anyways to take advantage of these changes ?

I tested it on koboldcpp 1.71 (released today, having merged upstream changes yesterday) - but you might be right some changes from original llama.cpp are missing (or some extra changes might be problematic). I am not sure if it's something to worry about on the llama.cpp project side, just pointing it out as potential side effect - for some people new GGUFs might not work.

MoonRide303 avatar Jul 25 '24 13:07 MoonRide303

I wasn't aware that this PR breaks previously quantized Llama 3.1 models, I thought that all prior models would continue to work. Shouldn't the tensor count always match what's declared in the metadata regardless?

Anyway, I'll do a patch release once this is merged into master.

LostRuins avatar Jul 25 '24 13:07 LostRuins

I wasn't aware that this PR breaks previously quantized Llama 3.1 models, I thought that all prior models would continue to work. Shouldn't the tensor count always match what's declared in the metadata regardless?

Anyway, I'll do a patch release once this is merged into master.

Old GGUFs will continue to work, it's just newly converted models (with this PR changes) that might not work with older llama.cpp-based apps (like koboldcpp 1.71).

MoonRide303 avatar Jul 25 '24 13:07 MoonRide303

Yeah but that's my point - it's failing to load because the tensor counts don't match the metadata, which it should in both cases, unless I misunderstand the error.

LostRuins avatar Jul 25 '24 13:07 LostRuins

Merging with koboldcpp concedo_experimental works fine on my end, FWIW.

kallewoof avatar Jul 25 '24 13:07 kallewoof

Why couldn't this tensor be added by llama.cpp when loading? superficially it doesn't make much sense to bake the rope config into the model at conversion time, and prevents bugfixes at a later time.

schmorp avatar Jul 25 '24 13:07 schmorp

Thanks! Given the hard check on the tensor count at load time, it seems we might need to instead add the rope parameters as metadata and calculate it at runtime. Definitely did not want to break backwards compatibility with old GGUF files with this change sorry!

jmorganca avatar Jul 25 '24 15:07 jmorganca

Thank you for looking into this. Btw even if it is in the metadata and the rope freqs are computed at runtime, it will still not be backwards compatible because the old GGUF files will not have the metadata

ggerganov avatar Jul 25 '24 16:07 ggerganov

Storing the information as metadata and computing at runtime seems preferable to me, as I think it seems more correct to have the tensor count match that of loading the model in HF transformers, for example (and just reduce potential confusion and people making incorrect assumptions based on mismatched tensor count). One could bake in defaults for when the metadata is missing from the gguf - that would make it backwards compatible, if desired.

gilbertgong avatar Jul 25 '24 19:07 gilbertgong

Thank you for your effort on this, we are all waiting patiently for the rope to be finished! We have some great functionality in GPT4All with this new model!

image

3Simplex avatar Jul 25 '24 22:07 3Simplex

@ggerganov do you know if there a preferred way to create tensors from scalar values (vs using ml.create_tensor)? Working on an iteration of this PR that does this (vs setting it as a tensor in the GGUF file)

jmorganca avatar Jul 26 '24 01:07 jmorganca

I tried testing this (building off this branch and doing a fresh Q8 gguf with the resulting convert script), in comparison to simply using --rope-freq-base 8000000 with the unpatched binary and just the change to llama-bpe (as suggested in the linked thread), and actually simply using --rope-freq-base 8000000 seemed to give overall higher quality from a subjective perspective. Have others had good results with this patch?

gilbertgong avatar Jul 26 '24 06:07 gilbertgong

@jmorganca Creation of tensors at runtime hasn't been needed in the past, so we haven't considered it yet. I think it would require to create additional contexts for runtime-computed tensors for each layer (see llama_kv_cache_init for something similar). Then the runtime data has to be computed on the CPU and applied to the tensors with ggml_backend_tensor_set.

IMO it would be much simpler to keep the frequency computation logic in the Python code and proceed with the current PR as it is (+ addressing @Galunid and @compilade comments). This way it is also more inline with the Phi3 long/short rope frequency logic. I don't think there are significant advantages to computing the rope freqs at runtime from the metadata and moreover both approaches are backwards incompatible so new GGUFs will have to be created either way.

Given the hard check on the tensor count at load time

Not sure which logic is that, can you clarify?

ggerganov avatar Jul 26 '24 09:07 ggerganov

@ggerganov okay, thanks. I can address the comments to unblock the PR

This is the check https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L4142 that seems to cause the error:

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291

Do you know if there's a way we might be able to work around that somehow? The thinking around computing them at runtime is old clients will continue to run the new models (albeit not well with n_ctx > 8192)

jmorganca avatar Jul 26 '24 09:07 jmorganca

The thinking around computing them at runtime is old clients will continue to run the new models (albeit not well with n_ctx > 8192)

The proposed changes to the RoPE also affect n_ctx < 8192. So even if old clients were able to use the new models, the generations would be technically incorrect even for shorter contexts, because of the missing factors. Would this be desired?

But regardless, I can't think of a simple workaround that would enable this

ggerganov avatar Jul 26 '24 09:07 ggerganov

I'd rather see the old llama-3.1 GGUF's totally break rather than risking users downloading an old model with degraded quality and giving them a bad impression. A program ~~crash~~ error+exit will queue the user to investigate what's wrong. Degraded performance just might make the user think the software/model is of low quality.

m18coppola avatar Jul 26 '24 13:07 m18coppola

A program crash will queue the user to investigate what's wrong. Degraded performance just might make the user think the software/model is of low quality.

Exactly! I see no reason to make it backward compatible.

qnixsynapse avatar Jul 26 '24 14:07 qnixsynapse

Crashing outright is bad, a better result would be a warning printed to stdout. It's always better to fail gracefully when possible.

sort of like how we did this: https://github.com/ggerganov/llama.cpp/blob/01245f5b1629075543bc4478418c7d72a0b4b3c7/src/llama.cpp#L5344-L5352

LostRuins avatar Jul 26 '24 14:07 LostRuins

It doesn't crash, it causes llama_load_model_from_file to return an error.

slaren avatar Jul 26 '24 14:07 slaren

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

gilbertgong avatar Jul 26 '24 21:07 gilbertgong

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

We need more of this mentality around here. There's usually a way

oldgithubman avatar Jul 26 '24 21:07 oldgithubman

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

If so, I think there should be a warning similar to what @LostRuins mentioned.

m18coppola avatar Jul 26 '24 21:07 m18coppola