exllama
exllama copied to clipboard
NTK RoPE scaling.
According to this post, this is a method of rope scaling that result in less perplexity loss and a bigger possible scaling: https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
the code can be found in this notebook : https://colab.research.google.com/drive/1VI2nhlyKvd5cw4-zHvAIk00cAVj2lCCC#scrollTo=b80b3f37
and the code for it seem to be a small change :
#The method is just these three lines
max_position_embeddings = 16384
a = 8 #Alpha value
base = base * a ** (dim / (dim-2)) #Base change formula
maybe it would be nice to add that option to exllama as well, with this technique finetuning for higher context may not even be necessary.
This sounds pretty good! But wondering how it would be implemented on exllama. compress_pos_emb is already a RoPE scaler.
There's
rotary_embedding_base
But it seems to be used for training purposes.
@Panchovix someone posted this code on 4chan, i haven't had the time to verify it as I'm on the move but maybe that's it. https://boards.4chan.org/g/thread/94354163#p94356720
@alkeryn Thanks! It seems to work.
a = 4 # Similar to RoPE, higher is more perplex but more ctx
self.rotary_embedding_base = self.rotary_embedding_base * a ** (self.head_dim / (self.head_dim -2 ))
max_seq_len should be set as the same as you have to do with SuperHOT models (via -l) Maybe it can be set like this on model.py:
self.alpha_value = 1.0 # Similar to RoPE, higher is more perplex but more ctx
And like this on model_init.py
parser.add_argument("-a", "--alpha", type = float, help = "alpha for context size extension via embedding extension")
...
if args.alpha:
model_config.alpha_value = args.alpha # not exactly like this, but with this logic
Okay, I did a experimental PR to see if turbo wants to add it, or maybe testing it via other way.
https://github.com/turboderp/exllama/pull/118
I'd like to see some results from finetuning before I go and add even more config options. If I built out ExLlama every time someone had an interesting idea on reddit it'd be an unmaintainable behemoth by now. It's already kind of unwieldy.
Okay, I did a experimental PR to see if turbo wants to add it, or maybe testing it via other way.
#118
so for using this feature, we should first tune the model with lora or whatever first? since exllama does not support turning now, should I first using auto-gptq lora ?
@laoda513 For NTK RoPE scaling, finetuning it is not needed. But based on my tests, superhot models works better with both RoPE scaling + comb scaling.
For now, no loader supports NTK RoPE.
That PR adds experimental supports only for exllama at the moment.
@Panchovix i don't quite understand how it would work better with rope + comb scaling but that's interesting, so you put 4 for each ? though i think once we have comb finetunes, it'll probably outperform superhot + rope scaling or even the mix of both. still being able to use any model at any context length without a finetune is already great !
I have tested the change and get better results with compression at 4 and alpha at for.
Using TheBloke_nous-hermes-13b-superhot-8k-GPTQ-4bit-128g, if I only have either compression or NTK Rope enabled, it tells me it cannot find the secret messages I left embedded in the paper, but with alpha 4 and compression at 4 it retrieves correctly
@ottobunge interesting, have you tried alpha 8 or more with no compression on a normal model ? would still be interesting to see finetunes made for ntk.
at 8k on neko institute llama 13b 4bit 32g at alpha 8 and compression 1 I get nonsense.
trying alpha 10 and then alpha 4 compression 4 on this same model, to see differences
Alpha 10
Failure mode is worse at compression 4 alpha 4 on plain llama. this model is probably not great at the task xD
@ottobunge that makes sense since the model was trained for 8k rope. but i was asking about alpha 8 on a non 8k finetuned model with no compression.
That would be this https://github.com/turboderp/exllama/issues/115#issuecomment-1614311067
I'm downloading a non fine tuned version, but on the fined tuned I can run no compression at alpha 10 and get good results.
in fact it follows the formatting on the prompt better than compression 4 alpha 4
TheBloke_airoboros-13B-gpt4-1.4-GPTQ so a non fine tuned model at alpha 10 it got 3/4 pass phrases in the wrong order.
The correct order is on the second image
The best answer i got like this.
If I change the proportion more to one or another it start by misspelling milkshake to milshake or fails altogether if I change the proportion too much, and starts guessing cherry as the 4th, banana as the third and missing milkshake
I have updated the PR.
Before, the alpha value wasn't being applied correctly. (It was at 1.0) Now, it does it correctly, and thus, just by setting alpha for NTK RoPE scaling would be enough (without the need to set compress_pos_emb to the same value)
@ottobunge @alkeryn Can you guys test and see how it goes now? Results are WAY different, and IMO, better.
For tulu-30B-GPTQ (non-SuperHOT)
- Perplexity at 2048 ctx (no compress_pos_emb, no alpha RoPE): 5.2153
- Perplexity at 8192 ctx, compress_pos_emb = 4: 10.0813
- Perplexity at 8192 ctx, alpha = 4: 5.3534
- Perplexity at 8192 ctx, compress_pos_emb = 4, alpha = 4: 15.4406
For Tulu-30B-SuperHOT-8K-4bit-32g:
- Perplexity at 2048 ctx (compress_pos_emb = 1, no alpha RoPE): 53.2788 (Basically, for <2048 ctx don't use SuperHOT models)
- Perplexity at 8192 ctx, compress_pos_emb = 4: 5.8166
- Perplexity at 8192 ctx, alpha = 4: 7.5073
- Perplexity at 8192 ctx, compress_pos_emb = 4, alpha = 4: 6.0903
Basically, it seems that NTK RoPE scaling is better that we expected.
how about the mem cost increase for inference and training? it is linear? for example 1 for 2k and 2 for 4k..
and i think this is very exciting and interesting!When i think more on it, if we can easily extend a model trained with 2k to 8k. Is that mean we can extend a model with 512 to 2k?, And I think this doe not really exntend the 'attention', it just using the same volume of attention on a longer context right? is kind like....a human reading fastily....
how about the mem cost increase for inference and training? it is linear? for example 1 for 2k and 2 for 4k..
and i think this is very exciting and interesting!When i think more on it, if we can easily extend a model trained with 2k to 8k. Is that mean we can extend a model with 512 to 2k?, And I think this doe not really exntend the 'attention', it just using the same volume of attention on a longer context right? is kind like....a human reading fastily....
For training itself, sadly I'm not sure how it would be applied :(.
Also, thanks turbo for the PR merge!
Now NTK RoPE scaling can be used on exllama.
thank you everyone, i'm closing the issue ! :)