KoboldAI icon indicating copy to clipboard operation
KoboldAI copied to clipboard

support for trust_remote_code / 8k context

Open BlairSadewitz opened this issue 1 year ago • 10 comments

Hello,

There are a number of models I'd like to try which require this. I know that I asked you about this in the past, and IIRC you mentioned that you removed it because you wanted to implement it properly. In the interim, would you kindly instruct me on what I have to change in order to pass this flag to the appropriate call(s) (you don't have to do it for every conceivable situation/type of model, just for hf or hf_torch or whichever is necessary (16-bit, don't worry about loading in 8 or 4 bit) to load e.g. llama-based models, maybe falcon, etc. I'd just as happily patch transformers itself; whatever gets it to work. I'm mostly trying to load the models with increased context size.

Thanks.

BlairSadewitz avatar Jul 19 '23 14:07 BlairSadewitz

Being able to use a monkey patch would be cool, too, but I assume that's even more work.

BlairSadewitz avatar Jul 19 '23 14:07 BlairSadewitz

What I am most interested in is being able to use models which use this:

https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_rope_scaled_monkey_patch-16k.py

Most of them are 8k.

https://huggingface.co/TheBloke/airoboros-33B-gpt4-1-4-SuperHOT-8K-fp16/tree/main

BlairSadewitz avatar Jul 19 '23 15:07 BlairSadewitz

This is planned as a seperate addon but currently unfinished.

henk717 avatar Jul 19 '23 16:07 henk717

Oh, OK, fair enough. Whenever you have a spare moment, would you kindly tell me where in the code the call is which loads a 16-bit llama-based model (you know, that I'd download from HF) is so I could just rig it myself to work? Whenever I have the time, I will figure out how to use python to just tell me the line number. If that happens before you get around to replying to this, I'll close out the PR. It could be either the code in KoboldAI or the code in transformers itself, I don't care which.

BlairSadewitz avatar Jul 20 '23 16:07 BlairSadewitz

The easiest way to do it is with our Basic HF backend since there it will be in the from_pretrained lines, in the main backend its quite complicated. The hold-up is that the Basic HF backend is unfinished and unstable, so your milage may strongly vary.

henk717 avatar Jul 20 '23 18:07 henk717

Hmm, yeah, I'm having some issues with it. :(

Check this out, though: RoPE scaling got merged to transformers. Models don't have to be pretrained to use it, though apparently you lose accuracy if they aren't. Maybe you'd want to add support for this at some point? It works for gptneox, too, according to the chatter online.

https://github.com/huggingface/transformers/commit/34d94094279d2c903d9d8a51a65edb265f22c849#diff-9ba75cc28be7924a2fc43de1d2c8c7779ad597129d33d1af39153951463cd0bc

Also, there's this:

https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

The patch is three lines. That code ameliorates the decrease in perplexity. Here's a colab:

https://colab.research.google.com/drive/1VI2nhlyKvd5cw4-zHvAIk00cAVj2lCCC#scrollTo=b80b3f37

BlairSadewitz avatar Jul 22 '23 01:07 BlairSadewitz

I just noticed everything you merged. Thanks! I'd been hopping between forks, and this makes my life a lot easier.

BlairSadewitz avatar Jul 23 '23 21:07 BlairSadewitz

In case you aren't aware, transformers now has support for rope scaling.

https://huggingface.co/docs/transformers/main/model_doc/llama#transformers.LlamaConfig

BlairSadewitz avatar Aug 01 '23 04:08 BlairSadewitz

We automatically use rope scaling if its present in a models config. Manual control for it is planned.

henk717 avatar Aug 01 '23 11:08 henk717

Ooh, nice. That makes my life a lot easier.

Incidentally, I stumbled upon this:

https://github.com/jquesnelle/scaled-rope

Basically, it builds a wheel with the necessary code to support all these different scaling methods along with patch functions, e.g.

def patch_llama_for_linear_scaled_rotary_embeddings(model, scale): from .LlamaLinearScaledRotaryEmbedding import LlamaLinearScaledRotaryEmbedding for each in model.model.layers: each.self_attn.rotary_emb = LlamaLinearScaledRotaryEmbedding( each.self_attn.head_dim, scale=scale, device=each.self_attn.rotary_emb.inv_freq.device)

I found it because I had problems loading some different models because of the layers, which it takes care of.

BlairSadewitz avatar Aug 04 '23 18:08 BlairSadewitz