exllama NTK RoPE scaling.

According to this post, this is a method of rope scaling that result in less perplexity loss and a bigger possible scaling: https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

the code can be found in this notebook : https://colab.research.google.com/drive/1VI2nhlyKvd5cw4-zHvAIk00cAVj2lCCC#scrollTo=b80b3f37

and the code for it seem to be a small change :

 #The method is just these three lines
    max_position_embeddings = 16384
    a = 8 #Alpha value
    base = base * a ** (dim / (dim-2)) #Base change formula

maybe it would be nice to add that option to exllama as well, with this technique finetuning for higher context may not even be necessary.

Jun 29 '23 13:06 alkeryn

This sounds pretty good! But wondering how it would be implemented on exllama. compress_pos_emb is already a RoPE scaler.

There's

rotary_embedding_base

But it seems to be used for training purposes.

Jun 29 '23 19:06 Panchovix

@Panchovix someone posted this code on 4chan, i haven't had the time to verify it as I'm on the move but maybe that's it. https://boards.4chan.org/g/thread/94354163#p94356720

Jun 29 '23 20:06 alkeryn

@alkeryn Thanks! It seems to work.

a = 4 # Similar to RoPE, higher is more perplex but more ctx
self.rotary_embedding_base = self.rotary_embedding_base * a ** (self.head_dim / (self.head_dim -2 ))

max_seq_len should be set as the same as you have to do with SuperHOT models (via -l) Maybe it can be set like this on model.py:

self.alpha_value = 1.0 # Similar to RoPE, higher is more perplex but more ctx

And like this on model_init.py

parser.add_argument("-a", "--alpha", type = float, help = "alpha for context size extension via embedding extension")

...

if args.alpha:
    model_config.alpha_value = args.alpha # not exactly like this, but with this logic

Jun 29 '23 20:06 Panchovix

Okay, I did a experimental PR to see if turbo wants to add it, or maybe testing it via other way.

https://github.com/turboderp/exllama/pull/118

Jun 29 '23 22:06 Panchovix

I'd like to see some results from finetuning before I go and add even more config options. If I built out ExLlama every time someone had an interesting idea on reddit it'd be an unmaintainable behemoth by now. It's already kind of unwieldy.

Jun 29 '23 22:06 turboderp

Okay, I did a experimental PR to see if turbo wants to add it, or maybe testing it via other way.

#118

so for using this feature, we should first tune the model with lora or whatever first? since exllama does not support turning now, should I first using auto-gptq lora ?

Jun 30 '23 06:06 laoda513

@laoda513 For NTK RoPE scaling, finetuning it is not needed. But based on my tests, superhot models works better with both RoPE scaling + comb scaling.

For now, no loader supports NTK RoPE.

That PR adds experimental supports only for exllama at the moment.

Jun 30 '23 06:06 Panchovix

@Panchovix i don't quite understand how it would work better with rope + comb scaling but that's interesting, so you put 4 for each ? though i think once we have comb finetunes, it'll probably outperform superhot + rope scaling or even the mix of both. still being able to use any model at any context length without a finetune is already great !

Jun 30 '23 08:06 alkeryn

I have tested the change and get better results with compression at 4 and alpha at for.

Using TheBloke_nous-hermes-13b-superhot-8k-GPTQ-4bit-128g, if I only have either compression or NTK Rope enabled, it tells me it cannot find the secret messages I left embedded in the paper, but with alpha 4 and compression at 4 it retrieves correctly

Jun 30 '23 08:06 ottobunge

@ottobunge interesting, have you tried alpha 8 or more with no compression on a normal model ? would still be interesting to see finetunes made for ntk.

Jun 30 '23 08:06 alkeryn

at 8k on neko institute llama 13b 4bit 32g at alpha 8 and compression 1 I get nonsense.

Jun 30 '23 08:06 ottobunge

trying alpha 10 and then alpha 4 compression 4 on this same model, to see differences

Jun 30 '23 08:06 ottobunge

Alpha 10

Jun 30 '23 08:06 ottobunge

Failure mode is worse at compression 4 alpha 4 on plain llama. this model is probably not great at the task xD

Jun 30 '23 08:06 ottobunge

@ottobunge that makes sense since the model was trained for 8k rope. but i was asking about alpha 8 on a non 8k finetuned model with no compression.

Jun 30 '23 08:06 alkeryn

That would be this https://github.com/turboderp/exllama/issues/115#issuecomment-1614311067

Jun 30 '23 08:06 ottobunge

I'm downloading a non fine tuned version, but on the fined tuned I can run no compression at alpha 10 and get good results.

in fact it follows the formatting on the prompt better than compression 4 alpha 4

Jun 30 '23 08:06 ottobunge

TheBloke_airoboros-13B-gpt4-1.4-GPTQ so a non fine tuned model at alpha 10 it got 3/4 pass phrases in the wrong order.

The correct order is on the second image

Jun 30 '23 08:06 ottobunge

The best answer i got like this.

If I change the proportion more to one or another it start by misspelling milkshake to milshake or fails altogether if I change the proportion too much, and starts guessing cherry as the 4th, banana as the third and missing milkshake

Jun 30 '23 10:06 ottobunge

I have updated the PR.

Before, the alpha value wasn't being applied correctly. (It was at 1.0) Now, it does it correctly, and thus, just by setting alpha for NTK RoPE scaling would be enough (without the need to set compress_pos_emb to the same value)

@ottobunge @alkeryn Can you guys test and see how it goes now? Results are WAY different, and IMO, better.

Jun 30 '23 20:06 Panchovix

For tulu-30B-GPTQ (non-SuperHOT)

Perplexity at 2048 ctx (no compress_pos_emb, no alpha RoPE): 5.2153
Perplexity at 8192 ctx, compress_pos_emb = 4: 10.0813
Perplexity at 8192 ctx, alpha = 4: 5.3534
Perplexity at 8192 ctx, compress_pos_emb = 4, alpha = 4: 15.4406

For Tulu-30B-SuperHOT-8K-4bit-32g:

Perplexity at 2048 ctx (compress_pos_emb = 1, no alpha RoPE): 53.2788 (Basically, for <2048 ctx don't use SuperHOT models)
Perplexity at 8192 ctx, compress_pos_emb = 4: 5.8166
Perplexity at 8192 ctx, alpha = 4: 7.5073
Perplexity at 8192 ctx, compress_pos_emb = 4, alpha = 4: 6.0903

Basically, it seems that NTK RoPE scaling is better that we expected.

Jun 30 '23 21:06 Panchovix

how about the mem cost increase for inference and training? it is linear? for example 1 for 2k and 2 for 4k..

and i think this is very exciting and interesting！When i think more on it, if we can easily extend a model trained with 2k to 8k. Is that mean we can extend a model with 512 to 2k？, And I think this doe not really exntend the 'attention', it just using the same volume of attention on a longer context right? is kind like....a human reading fastily....

Jul 01 '23 08:07 laoda513

how about the mem cost increase for inference and training? it is linear? for example 1 for 2k and 2 for 4k..

and i think this is very exciting and interesting！When i think more on it, if we can easily extend a model trained with 2k to 8k. Is that mean we can extend a model with 512 to 2k？, And I think this doe not really exntend the 'attention', it just using the same volume of attention on a longer context right? is kind like....a human reading fastily....

For training itself, sadly I'm not sure how it would be applied :(.

Also, thanks turbo for the PR merge!

Now NTK RoPE scaling can be used on exllama.

Jul 01 '23 20:07 Panchovix

thank you everyone, i'm closing the issue ! :)

Jul 07 '23 23:07 alkeryn

exllama exllama copied to clipboard

NTK RoPE scaling.

exllama
exllama copied to clipboard