peft
peft copied to clipboard
Adding additional tokens to vocabulary
I am using models like EleutherAI/gpt-j-6B and llama-7b-hf for text generation.
I have added special tokens to the vocabulary as I want a structured output.
Prompt
"<|begincontext|>I want to make a restaurant reservation for 2 people at half past 11 in the morning.<|endcontext|>",
Target
"<|begintarget|><|begindsts|><|begindst|><|beginintent|>FindRestaurants<|endintent|><|beginbelief|><|endbelief|><|enddst|><|enddsts|><|beginuseraction|>INFORM_INTENT->Restaurants^intent~FindRestaurants<|enduseraction|><|beginaction|>REQUEST->Restaurants^city~<|endaction|><|beginresponse|>Do you have a specific which you want the eating place to be located at?<|endresponse|><|endtarget|>"
I have an example Colab Notebook https://colab.research.google.com/drive/16qKy92cGoNPWrlQ4zlvntVGeSgjrknVF?usp=sharing
I am able to train the model without any errors. However, when I perform inference, it does not produce any structured output, it just produces some random generation.
Here is a sample generation
<|endintent|> I\'ll make the reservation for 6 o"clock in the evening, for two people. I\'ll make the reservation for 6 o"clock in the evening, for two people. I\'ll make the reservation for 6 o"clock in the evening, for two people.
In my original code, when I train on a lot of data and plot the train/eval loss I can see that the train/eval loss decreases to values, train_loss=0.2163, eval_loss = 0.2416. With such low loss values, I am surprised why the generation has absolutely no structure. With a GPT-2 model, training for a few steps with a small amount of data produces a structured output.
This issue #326 talks about additional tokens in the vocabulary, which is similar to what I want to do.
Can you please give me some pointers on where I am going wrong.
Hello, during full finetuning, the embedding layer with additional tokens is also trained which is not the case when using PEFT LoRA as per the code you shared. I think this might work if you also train only the embedding layers along with LoRa layers. To do that, specify modules_to_save in LoraConfig like below for gpt-j:
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=target_modules,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
base_model_name_or_path=model_name,
modules_to_save=["wte"]
)
Also, use the main branch of PEFT
thank you very much for your response. I will make the changes you suggested and get back to you.
Hello, during full finetuning, the embedding layer with additional tokens is also trained which is not the case when using PEFT LoRA as per the code you shared. I think this might work if you also train only the embedding layers along with LoRa layers. To do that, specify
modules_to_savein LoraConfig like below for gpt-j:config = LoraConfig( r=16, lora_alpha=32, target_modules=target_modules, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", base_model_name_or_path=model_name, modules_to_save=["wte"] )Also, use the main branch of PEFT
As I mention in this issue:
#https://github.com/huggingface/peft/issues/349#issue-1677573675
Adding the embedding layer on any of the transformer models I've tried gives me a float error. Adding it to the LoRA target argument gives me an attribute error for bias. Has anyone actually tested using this out of the box, or am I missing something?
I have a same error.
https://github.com/huggingface/peft/issues/349#issuecomment-1527059611
Hello @adibMosharrof, see this comment please: https://github.com/huggingface/peft/pull/337#issuecomment-1527412343
@pacman100 I had tried using what u suggested before but that did not seem to work. I will try this new suggestion. P.s. I have been traveling, so I am sorry for not being responsive.
@pacman100 Thank you very much for looking into this issue.
In your example notebook, I can see that you no longer use
model = prepare_model_for_int8_training(model)
Initially I had that in my code, and I was having the same issues as before, but once I removed that I was getting better generation after training. Could you please explain why this is happening.
In my actual training script, I have a validation set which is similar to the csv file I shared. I train for more epochs with a bigger batch size by using grad accumulation. However, I see that the eval loss becomes nan.
All these concepts are quite new to me, so I really dont understand how to solve this.
@pacman100 I ran the code from the notebook you shared in #337.
The only changes I made was to load the model in 8bit
`model_name = "EleutherAI/gpt-j-6B"
model = AutoModelForCausalLM.from_pretrained( model_name, load_in_8bit=True, device_map="auto" ) ` However, in the 2nd logging step, I see that my loss becomes 0.0.
` {'loss': 4.8516, 'learning_rate': 2.5e-05, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 5e-05, 'epoch': 0.08}
{'loss': 0.0, 'learning_rate': 7.5e-05, 'epoch': 0.12}
{'loss': 0.0, 'learning_rate': 0.0001, 'epoch': 0.16}
16%|█████████████████████████████████▊
| 20/123 [00:58<04:34, 2.66s/it]
`
Below is the output of generation
<|begincontext|><|user|>I am feeling hungry so I would like to find a place to eat.<|system|>Do you have a specific which you want the eating place to be located at?<|user|>I would like for it to be in San Jose.<|system|>Is there a specific cuisine type you enjoy, such as Mexican, Italian or something else?<|beginlastuserutterance|>I usually like eating the American type of food.<|endlastuserutterance|><|endcontext|>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
`peft: 0.4.0.dev0 4fd374e
bitsandbytes = 0.38.1 `
I am confused why I am getting different results. It would be great if you could shed some light on this.
I was able to make things work and all my requirements have been fulfilled. @pacman100 Thank you very much for supporting me with this issue. I am really grateful for the effort you put in.
A running code example can be found in the notebook I had shared initially
https://colab.research.google.com/drive/16qKy92cGoNPWrlQ4zlvntVGeSgjrknVF?usp=sharing#scrollTo=tpfeUu0NKQRs
I would like to share a few things I stumbled upon so that others dont face the same issue.
8-bit Training
If you want to train in 8 bit, you actually need the line below
model = prepare_model_for_int8_training(model)
{'eval_loss': 11.521041870117188, 'eval_runtime': 300.9554, 'eval_samples_per_second': 0.984, 'eval_steps_per_second': 0.083, 'epoch': 1.71}
{'eval_loss': 10.17847728729248, 'eval_runtime': 300.8959, 'eval_samples_per_second': 0.984, 'eval_steps_per_second': 0.083, 'epoch': 3.42}
{'eval_loss': 7.587922096252441, 'eval_runtime': 301.4311, 'eval_samples_per_second': 0.982, 'eval_steps_per_second': 0.083, 'epoch': 5.13}
{'eval_loss': 2.641119956970215, 'eval_runtime': 302.047, 'eval_samples_per_second': 0.98, 'eval_steps_per_second': 0.083, 'epoch': 6.84}
Bits And Bytes Version
I had to revert to bitsandbytes=0.37.2 as well.
In version 0.38.1, I would get out of memory exception when I called
model.save_pretrained(training_args.output_dir)
Additional tokens
Since I increased my vocabulary size by adding additional tokens, I had to add to the modules_to_save option in LoraConfig.
For Gpt-J model, I had to use modules_to_save = ["lm_head", "wte"] and for Llama, facebook opt I had to use modules_to_save = ["lm_head", "embed_tokens"]
Cuda Call errors
Another caveat I found, is that you cannot call model.cuda() as it messes things up internally somehow.
I load the model in 8 bit and it is loaded in the gpu.
Hi @adibMosharrof , sorry to bother, since this issue has already been closed. My question is that have you encountered any issues when loading the lora checkpoint that you trained? Or, how would you recommend to load loras trained with extra tokens? In my case, I treat it in this way:
model.resize_token_embeddings(len(tokenizer)) and then lora_model = PeftModel.from_pretrained(model, args.lora_name). But I get very different results compared to direct inference after training. Thanks in advance!
When you add extra tokens, the embedding dimensions of some layers change. You have to add those layers in modules_to_save. Depending on your model, the modules_to_save can change. Please take a look at the example notebook I shared for a working example.
I have added some important parts below,
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto",
torch_dtype=torch.float16,
)
model.resize_token_embeddings(len(tokenizer))
model = prepare_model_for_int8_training(model)
modules_to_save = ["lm_head", "embed_tokens"]
config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
base_model_name_or_path=model_name,
modules_to_save=modules_save,
)
model = get_peft_model(model, config)
When you add extra tokens, the embedding dimensions of some layers change. You have to add those layers in modules_to_save. Depending on your model, the modules_to_save can change. Please take a look at the example notebook I shared for a working example.
I have added some important parts below,
model = AutoModelForCausalLM.from_pretrained( model_name, load_in_8bit=True, device_map="auto", torch_dtype=torch.float16, ) model.resize_token_embeddings(len(tokenizer)) model = prepare_model_for_int8_training(model) modules_to_save = ["lm_head", "embed_tokens"] config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", base_model_name_or_path=model_name, modules_to_save=modules_save, ) model = get_peft_model(model, config)
Adding new tokens should mkae the embedding change size, but why it will affect the other layers? @adibMosharrof Also, the solution here greatly increase the effective parameter count, I am curious can we only touch the embedding for the newly added tokens instead?
When you add extra tokens, the embedding dimensions of some layers change. You have to add those layers in modules_to_save. Depending on your model, the modules_to_save can change. Please take a look at the example notebook I shared for a working example.
I have added some important parts below,
model = AutoModelForCausalLM.from_pretrained( model_name, load_in_8bit=True, device_map="auto", torch_dtype=torch.float16, ) model.resize_token_embeddings(len(tokenizer)) model = prepare_model_for_int8_training(model) modules_to_save = ["lm_head", "embed_tokens"] config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", base_model_name_or_path=model_name, modules_to_save=modules_save, ) model = get_peft_model(model, config)
Use this method and do SFT training. however gradient norm is huge. Is this normal?
When you add extra tokens, the embedding dimensions of some layers change. You have to add those layers in modules_to_save. Depending on your model, the modules_to_save can change. Please take a look at the example notebook I shared for a working example.
I have added some important parts below,
model = AutoModelForCausalLM.from_pretrained( model_name, load_in_8bit=True, device_map="auto", torch_dtype=torch.float16, ) model.resize_token_embeddings(len(tokenizer)) model = prepare_model_for_int8_training(model) modules_to_save = ["lm_head", "embed_tokens"] config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", base_model_name_or_path=model_name, modules_to_save=modules_save, ) model = get_peft_model(model, config)
This did not work with setup_chat_format and llama2 7b. still says:
size mismatch for base_model.model.model.embed_tokens.modules_to_save.default.weight: copying a param with shape torch.Size([32002, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]). size mismatch for base_model.model.lm_head.modules_to_save.default.weight: copying a param with shape torch.Size([32002, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
Looking for help ='(
@deema-A Could you please provide the code that results in this error? Also, please always paste the full error message. Otherwise, it will be very hard to figure out the cause of the error.