trl icon indicating copy to clipboard operation
trl copied to clipboard

Documentation about setup_chat_format()

Open bibhas2 opened this issue 1 year ago • 1 comments

Documentation URL:

https://huggingface.co/docs/trl/en/sft_trainer#add-special-tokens-for-chat-format

In section Add Special Tokens for Chat Format the page encourages to use setup_chat_format().

If one creates a tokenizer from a Hugging Face model then the tokenizer is already configured (with EOS, BOS, chat template etc.).

tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1", 
    token=access_token)

At this point calling setup_chat_format () will completely override the tokenizer's settings with chatml.

# Set up the chat format with default 'chatml' format
model, tokenizer = setup_chat_format(model, tokenizer)

This seems to me will do more harm than good.

bibhas2 avatar Mar 20 '24 15:03 bibhas2

I also noticed that setup_chat_format will add 2 new tokens to the tokenizer. After saving the model and the tokenizer it will be more difficult to manage this difference:

	size mismatch for base_model.model.model.embed_tokens.weight: copying a param with shape torch.Size([32002, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
	size mismatch for base_model.model.lm_head.weight: copying a param with shape torch.Size([32002, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).```

joaomsimoes avatar Mar 26 '24 07:03 joaomsimoes

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar Apr 20 '24 15:04 github-actions[bot]

I have this issue:

size mismatch for base_model.model.model.embed_tokens.weight: copying a param with shape torch.Size([32002, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).

size mismatch for base_model.model.lm_head.weight: copying a param with shape torch.Size([32002, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).

deema-A avatar Jul 19 '24 16:07 deema-A

I've same issue too. Did you guys solve this problem? @deema-A @joaomsimoes

architectyou avatar Sep 23 '24 01:09 architectyou

I've same issue too. Did you guys solve this problem? @deema-A @joaomsimoes

Nope ='\

deema-A avatar Sep 23 '24 04:09 deema-A

Hey, may be you can help me: I try to finetune llama:


base_model = "meta-llama/Llama-3.2-1B-Instruct"

torch_dtype = torch.float16
attn_implementation = "eager"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=attn_implementation,
)

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if "lm_head" in lora_module_names:  # needed for 16 bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)


modules = find_all_linear_names(model)

# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules,
)

model, tokenizer = setup_chat_format(model, tokenizer)
model = get_peft_model(model, peft_config)

But there has been an exception:

line 101, in setup_chat_format raise ValueError ValueError: Chat template is already added to the tokenizer. If you want to overwrite it, please set it to None

May be some one can help, thx in advance

mik8142 avatar Dec 16 '24 17:12 mik8142

I found solution that works for me:

if hasattr(tokenizer, "chat_template") and tokenizer.chat_template is not None:
    tokenizer.chat_template = None  # Reset the chat template

model, tokenizer = setup_chat_format(model, tokenizer)

mik8142 avatar Dec 16 '24 17:12 mik8142