h2o-llmstudio icon indicating copy to clipboard operation
h2o-llmstudio copied to clipboard

[BUG] Tokenizer config has add_bos_token=true while LLM Studio is training with add_special_tokens=False

Open pascal-pfeiffer opened this issue 1 year ago • 1 comments
trafficstars

🐛 Bug

The generated tokenizer_config.json has add_bos_token=true while H2O LLM Studio is training with add_special_tokens=False. Using the default AutoTokenizer, this leads to different behaviors.

We should be explicit/correct about it and set add_bos_token=false

To Reproduce

Fine tune a model and download / push to HF

LLM Studio version

<=1.4.1, b70b04f68d16ae73524d7f38f45e571ddb92cfc3

pascal-pfeiffer avatar Mar 21 '24 07:03 pascal-pfeiffer

add_eos_token=false as well

psinger avatar Mar 21 '24 07:03 psinger