Potential Duplication of BOS Token

Open chrisliu298 opened this issue 1 year ago • 0 comments

I noticed that, for default (sequence classification) models with chat template defined in the tokenizer, scripts/run_rm.py formats each conversation by tokenizer.apply_chat_template (via the function prepare_dialogue_from_tokenizer) and then uses the text classification pipeline to process the formatted conversations. Given that 1) many models' tokenizers (e.g., Llama-3 instruct series, Gemma-2 instruct series, etc.) define the bos_token in the chat template, and 2) the pipeline adds another bos_token during tokenization, does it mean these models read in two bos tokens in the forward pass?

I also realized that some models (e.g., ArmoRM) inherently avoids this potential issue via customized pipeline by directly performing tokenization using tokenizer.apply_chat_template (as opposed to first formatting, then tokenizing).

Aug 20 '24 02:08 chrisliu298