unitxt
unitxt copied to clipboard
HFSystemFormat Exception
When using HFSystemFormat to define the format when loading a dataset an error is shown:
An error occurred while generating the dataset
HFSystemFormat function at the end applies a chat template to the message:
....
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
....
tokenized_chat = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
What happens is that if self.model_name does not have a chat_template attribute in tokenizer_config.json (e.g. granite 34 b code instruct ) it will raise an exception...
This code shows the problem:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-20b-code-base")
chat = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
{"role": "user", "content": "I'd like to show off how chat templating works!"},
]
print(tokenizer.apply_chat_template(chat, tokenize=False))
ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating
Is it possible to improve the error handling in this case? I don't know if its possible to have a default chat template or be more explicit in the error so the user could take action...
Thanks.