peft Adding additional tokens to vocabulary

I am using models like EleutherAI/gpt-j-6B and llama-7b-hf for text generation. I have added special tokens to the vocabulary as I want a structured output.

I have an example Colab Notebook https://colab.research.google.com/drive/16qKy92cGoNPWrlQ4zlvntVGeSgjrknVF?usp=sharing

I am able to train the model without any errors. However, when I perform inference, it does not produce any structured output, it just produces some random generation.

Here is a sample generation <|endintent|> I\'ll make the reservation for 6 o"clock in the evening, for two people. I\'ll make the reservation for 6 o"clock in the evening, for two people. I\'ll make the reservation for 6 o"clock in the evening, for two people.

In my original code, when I train on a lot of data and plot the train/eval loss I can see that the train/eval loss decreases to values, train_loss=0.2163, eval_loss = 0.2416. With such low loss values, I am surprised why the generation has absolutely no structure. With a GPT-2 model, training for a few steps with a small amount of data produces a structured output.

This issue #326 talks about additional tokens in the vocabulary, which is similar to what I want to do.

Can you please give me some pointers on where I am going wrong.

Apr 19 '23 02:04 adibMosharrof

Hello, during full finetuning, the embedding layer with additional tokens is also trained which is not the case when using PEFT LoRA as per the code you shared. I think this might work if you also train only the embedding layers along with LoRa layers. To do that, specify modules_to_save in LoraConfig like below for gpt-j:

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=target_modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    base_model_name_or_path=model_name,
   modules_to_save=["wte"]
)

Also, use the main branch of PEFT

Apr 19 '23 05:04 pacman100

thank you very much for your response. I will make the changes you suggested and get back to you.

Apr 19 '23 16:04 adibMosharrof

Hello, during full finetuning, the embedding layer with additional tokens is also trained which is not the case when using PEFT LoRA as per the code you shared. I think this might work if you also train only the embedding layers along with LoRa layers. To do that, specify modules_to_save in LoraConfig like below for gpt-j:
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=target_modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    base_model_name_or_path=model_name,
   modules_to_save=["wte"]
)
Also, use the main branch of PEFT

As I mention in this issue:

#https://github.com/huggingface/peft/issues/349#issue-1677573675

Adding the embedding layer on any of the transformer models I've tried gives me a float error. Adding it to the LoRA target argument gives me an attribute error for bias. Has anyone actually tested using this out of the box, or am I missing something?

Apr 21 '23 20:04 SOCSChamp

I have a same error.

Apr 22 '23 09:04 iMountTai

https://github.com/huggingface/peft/issues/349#issuecomment-1527059611

Apr 28 '23 06:04 pacman100

Hello @adibMosharrof, see this comment please: https://github.com/huggingface/peft/pull/337#issuecomment-1527412343

Apr 28 '23 11:04 pacman100

@pacman100 I had tried using what u suggested before but that did not seem to work. I will try this new suggestion. P.s. I have been traveling, so I am sorry for not being responsive.

May 01 '23 16:05 adibMosharrof

@pacman100 Thank you very much for looking into this issue.

In your example notebook, I can see that you no longer use model = prepare_model_for_int8_training(model)

Initially I had that in my code, and I was having the same issues as before, but once I removed that I was getting better generation after training. Could you please explain why this is happening.

In my actual training script, I have a validation set which is similar to the csv file I shared. I train for more epochs with a bigger batch size by using grad accumulation. However, I see that the eval loss becomes nan.

All these concepts are quite new to me, so I really dont understand how to solve this.

May 11 '23 17:05 adibMosharrof

@pacman100 I ran the code from the notebook you shared in #337.

The only changes I made was to load the model in 8bit

`model_name = "EleutherAI/gpt-j-6B"

model = AutoModelForCausalLM.from_pretrained( model_name, load_in_8bit=True, device_map="auto" ) ` However, in the 2nd logging step, I see that my loss becomes 0.0.

` {'loss': 4.8516, 'learning_rate': 2.5e-05, 'epoch': 0.04}

{'loss': 0.0, 'learning_rate': 5e-05, 'epoch': 0.08}

{'loss': 0.0, 'learning_rate': 7.5e-05, 'epoch': 0.12}

{'loss': 0.0, 'learning_rate': 0.0001, 'epoch': 0.16}

16%|█████████████████████████████████▊
| 20/123 [00:58<04:34, 2.66s/it] `

Below is the output of generation

`peft: 0.4.0.dev0 4fd374e

bitsandbytes = 0.38.1 `

I am confused why I am getting different results. It would be great if you could shed some light on this.

May 16 '23 15:05 adibMosharrof

I was able to make things work and all my requirements have been fulfilled. @pacman100 Thank you very much for supporting me with this issue. I am really grateful for the effort you put in.

A running code example can be found in the notebook I had shared initially

https://colab.research.google.com/drive/16qKy92cGoNPWrlQ4zlvntVGeSgjrknVF?usp=sharing#scrollTo=tpfeUu0NKQRs

I would like to share a few things I stumbled upon so that others dont face the same issue.

8-bit Training

If you want to train in 8 bit, you actually need the line below model = prepare_model_for_int8_training(model)

{'eval_loss': 11.521041870117188, 'eval_runtime': 300.9554, 'eval_samples_per_second': 0.984, 'eval_steps_per_second': 0.083, 'epoch': 1.71}
{'eval_loss': 10.17847728729248, 'eval_runtime': 300.8959, 'eval_samples_per_second': 0.984, 'eval_steps_per_second': 0.083, 'epoch': 3.42}
{'eval_loss': 7.587922096252441, 'eval_runtime': 301.4311, 'eval_samples_per_second': 0.982, 'eval_steps_per_second': 0.083, 'epoch': 5.13}
{'eval_loss': 2.641119956970215, 'eval_runtime': 302.047, 'eval_samples_per_second': 0.98, 'eval_steps_per_second': 0.083, 'epoch': 6.84}

Bits And Bytes Version

I had to revert to bitsandbytes=0.37.2 as well.

In version 0.38.1, I would get out of memory exception when I called model.save_pretrained(training_args.output_dir)

Additional tokens

Since I increased my vocabulary size by adding additional tokens, I had to add to the modules_to_save option in LoraConfig.

For Gpt-J model, I had to use modules_to_save = ["lm_head", "wte"] and for Llama, facebook opt I had to use modules_to_save = ["lm_head", "embed_tokens"]

Cuda Call errors

Another caveat I found, is that you cannot call model.cuda() as it messes things up internally somehow. I load the model in 8 bit and it is loaded in the gpu.

May 24 '23 18:05 adibMosharrof

Hi @adibMosharrof , sorry to bother, since this issue has already been closed. My question is that have you encountered any issues when loading the lora checkpoint that you trained? Or, how would you recommend to load loras trained with extra tokens? In my case, I treat it in this way: model.resize_token_embeddings(len(tokenizer)) and then lora_model = PeftModel.from_pretrained(model, args.lora_name). But I get very different results compared to direct inference after training. Thanks in advance!

Oct 19 '23 10:10 HenryCai11

When you add extra tokens, the embedding dimensions of some layers change. You have to add those layers in modules_to_save. Depending on your model, the modules_to_save can change. Please take a look at the example notebook I shared for a working example.

I have added some important parts below,

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto",
    torch_dtype=torch.float16,
)
model.resize_token_embeddings(len(tokenizer))
model = prepare_model_for_int8_training(model)

modules_to_save = ["lm_head", "embed_tokens"]

config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    base_model_name_or_path=model_name,
    modules_to_save=modules_save,
)
model = get_peft_model(model, config)

Oct 31 '23 05:10 adibMosharrof

When you add extra tokens, the embedding dimensions of some layers change. You have to add those layers in modules_to_save. Depending on your model, the modules_to_save can change. Please take a look at the example notebook I shared for a working example.

I have added some important parts below,
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto",
    torch_dtype=torch.float16,
)
model.resize_token_embeddings(len(tokenizer))
model = prepare_model_for_int8_training(model)

modules_to_save = ["lm_head", "embed_tokens"]

config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    base_model_name_or_path=model_name,
    modules_to_save=modules_save,
)
model = get_peft_model(model, config)

Adding new tokens should mkae the embedding change size, but why it will affect the other layers? @adibMosharrof Also, the solution here greatly increase the effective parameter count, I am curious can we only touch the embedding for the newly added tokens instead?

Dec 20 '23 17:12 xiaoyunwu

When you add extra tokens, the embedding dimensions of some layers change. You have to add those layers in modules_to_save. Depending on your model, the modules_to_save can change. Please take a look at the example notebook I shared for a working example.

I have added some important parts below,
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto",
    torch_dtype=torch.float16,
)
model.resize_token_embeddings(len(tokenizer))
model = prepare_model_for_int8_training(model)

modules_to_save = ["lm_head", "embed_tokens"]

config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    base_model_name_or_path=model_name,
    modules_to_save=modules_save,
)
model = get_peft_model(model, config)

Use this method and do SFT training. however gradient norm is huge. Is this normal?

Jul 03 '24 12:07 qZhang88

When you add extra tokens, the embedding dimensions of some layers change. You have to add those layers in modules_to_save. Depending on your model, the modules_to_save can change. Please take a look at the example notebook I shared for a working example.

I have added some important parts below,
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto",
    torch_dtype=torch.float16,
)
model.resize_token_embeddings(len(tokenizer))
model = prepare_model_for_int8_training(model)

modules_to_save = ["lm_head", "embed_tokens"]

config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    base_model_name_or_path=model_name,
    modules_to_save=modules_save,
)
model = get_peft_model(model, config)

This did not work with setup_chat_format and llama2 7b. still says:

size mismatch for base_model.model.model.embed_tokens.modules_to_save.default.weight: copying a param with shape torch.Size([32002, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]). size mismatch for base_model.model.lm_head.modules_to_save.default.weight: copying a param with shape torch.Size([32002, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).

Looking for help ='(

Jul 17 '24 20:07 deema-A

@deema-A Could you please provide the code that results in this error? Also, please always paste the full error message. Otherwise, it will be very hard to figure out the cause of the error.

Jul 18 '24 09:07 BenjaminBossan

peft peft copied to clipboard

Adding additional tokens to vocabulary

8-bit Training

Bits And Bytes Version

Additional tokens

Cuda Call errors

peft
peft copied to clipboard