unitxt icon indicating copy to clipboard operation
unitxt copied to clipboard

HFSystemFormat Exception

Open allyssonf opened this issue 1 year ago • 0 comments

When using HFSystemFormat to define the format when loading a dataset an error is shown:

An error occurred while generating the dataset

HFSystemFormat function at the end applies a chat template to the message:

        ....

        tokenizer = AutoTokenizer.from_pretrained(self.model_name)

       ....

        tokenized_chat = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )

What happens is that if self.model_name does not have a chat_template attribute in tokenizer_config.json (e.g. granite 34 b code instruct ) it will raise an exception...

This code shows the problem:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-20b-code-base")

chat = [
   {"role": "user", "content": "Hello, how are you?"},
   {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
   {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

print(tokenizer.apply_chat_template(chat, tokenize=False))
ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating

Is it possible to improve the error handling in this case? I don't know if its possible to have a default chat template or be more explicit in the error so the user could take action...

Thanks.

allyssonf avatar Aug 13 '24 13:08 allyssonf