[PROBLEM] DeepSpeedChat Create HF Model FOR LLAMA Token ID Question

Open syngokhan opened this issue 2 years ago • 1 comments

Hello, I wish you good work.

I got stuck at a point here and wanted to get an answer from you. When we first set up the tokenizer structure, these were our token information for the OPT Models.

OPT TOKEN ID:
{'bos_token': '</s>',
 'eos_token': '</s>',
 'unk_token': '</s>',
 'pad_token': '</s>'}

 
And then in "create_hf_model" when calling the model;

model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = model.config.eos_token_id

as we equate.

But as far as I normally see in LLama Models, pad_token is not included and we add it later, in the "load_hf_tokenizer" section.

In short;

LLAMA TOKEN ID BEFORE ADD PAD TOKEN ID:
{'bos_token': '<s>', 
'eos_token': '</s>', 
'unk_token': '<unk>'}

...
tokenizer.add_special_tokens({"pad_token" : "[PAD]"})
.....


But while we are calling the LLAMA model in the "create_hf_model" section in this model section, it doesn't seem to apply our change in the pad_token section.

model.config.eos_token_id = tokenizer.eos_token_id ----> </s>
model.config.pad_token_id = model.config.eos_token_id --> </s>

Isn't that what it should be?

as model.config.pad_token_id = tokenizer.pad_token_id--> [PAD]
doesn't it need to be set?


Can you explain if there is a different approach for the MODEL TOKEN ID and TOKENIZER ID?

@awan-10

Sep 05 '23 10:09 syngokhan

@syngokhan - there's no need to focus on pad token, it can be anything. the pad embedding will never affect the output.

Dec 20 '23 06:12 EeyoreLee