transformers icon indicating copy to clipboard operation
transformers copied to clipboard

apply_chat_template return_assistant_tokens_mask not work for Qwen2.5

Open DogeWatch opened this issue 1 year ago • 2 comments

System Info

huggingface-hub-0.25.2 tokenizers-0.20.1 transformers-4.45.2

Who can help?

@ArthurZucker @itazap

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tk = AutoTokenizer.from_pretrained("Qwen2___5-7B-Instruct", trust_remote_code=True)
new_chat_template = '{%- if tools %}\n    {{- \'<|im_start|>system\\n\' }}\n    {%- if messages[0][\'role\'] == \'system\' %}\n        {{- messages[0][\'content\'] }}\n    {%- else %}\n        {{- \'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\' }}\n    {%- endif %}\n    {{- "\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>" }}\n    {%- for tool in tools %}\n        {{- "\\n" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- "\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\"name\\": <function-name>, \\"arguments\\": <args-json-object>}\\n</tool_call><|im_end|>\\n" }}\n{%- else %}\n    {%- if messages[0][\'role\'] == \'system\' %}\n        {{- \'<|im_start|>system\\n\' + messages[0][\'content\'] + \'<|im_end|>\\n\' }}\n    {%- else %}\n        {{- \'<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n\' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}\n        {{- \'<|im_start|>\' + message.role + \'\\n\' + message.content + \'<|im_end|>\' + \'\\n\' }}\n    {%- elif message.role == "assistant" %}\n        {% generation %} {{- \'<|im_start|>\' + message.role }}\n        {%- if message.content %}\n            {{- \'\\n\' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- \'\\n<tool_call>\\n{"name": "\' }}\n            {{- tool_call.name }}\n            {{- \'", "arguments": \' }}\n            {{- tool_call.arguments | tojson }}\n            {{- \'}\\n</tool_call>\' }}\n        {%- endfor %}\n        {{- \'<|im_end|>\\n\' }} {% endgeneration %}\n    {%- elif message.role == "tool" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}\n            {{- \'<|im_start|>user\' }}\n        {%- endif %}\n        {{- \'\\n<tool_response>\\n\' }}\n        {{- message.content }}\n        {{- \'\\n</tool_response>\' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n            {{- \'<|im_end|>\\n\' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- \'<|im_start|>assistant\\n\' }}\n{%- endif %}\n'
msg = [{'role': 'user', 'content': ' \n\n我想起十年前,我在大学里学习生物医学。我很熟悉书本知识,但我发现学术资格并不总是足够的。我需要实际的经验 – 现场工作,才能真正了解行业和工作流程。因此,我决定在学习期间参加志愿工作和实习项目。我参加了医院志愿者活动,这让我有机会与医疗专业人员和病人沟通,也在一定程度上了解了他们的需要。我还在当地的药房实习,这帮助我更深入地了解了医药企业的商业模式和销售策略。最终,我毕业后找到一份医药销售代表的工作,因为我既有学术背景,又有现场工作经验,能够灵活处理各种情况,更好地满足客户需求。这份工作最终让我在职业生涯中获得了巨大成功。\n\n\n基于以上这段文本内容回答: 你参加的志愿者活动和实习项目可以带给你哪些实际的经验和收获? \n\n'}, {'role': 'assistant', 'content': '参加医院志愿者活动和实习项目让我有机会与医疗专业人员和病人沟通,了解他们的需要。在当地的药房实习让我更深入地了解了医药企业的商业模式和销售策略。这些实际的经验和收获帮助我更好地满足客户需求,并在职业生涯中取得了成功。'}]
output = tk.apply_chat_template(
    msg,
    chat_template=new_chat_template,
    tokenize=True,
    add_generation_prompt=True,
    padding=True,
    max_length=2048,
    truncation=True,
    return_dict=True,
    return_assistant_tokens_mask=True,
)
print(output['assistant_masks'])

I modified the chat template to add the {% generation %} and {% endgeneration %} the new_chat_template looks like

{%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0]['role'] == 'system' %}
        {{- messages[0]['content'] }}
    {%- else %}
        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
    {%- endif %}
    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0]['role'] == 'system' %}
        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
    {%- else %}
        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- for message in messages %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {% generation %} {{- '<|im_start|>' + message.role }}
        {%- if message.content %}
            {{- '\n' + message.content }}
        {%- endif %}
        {%- for tool_call in message.tool_calls %}
            {%- if tool_call.function is defined %}
                {%- set tool_call = tool_call.function %}
            {%- endif %}
            {{- '\n<tool_call>\n{"name": "' }}
            {{- tool_call.name }}
            {{- '", "arguments": ' }}
            {{- tool_call.arguments | tojson }}
            {{- '}\n</tool_call>' }}
        {%- endfor %}
        {{- '<|im_end|>\n' }} {% endgeneration %}
    {%- elif message.role == "tool" %}
        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- message.content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}

Expected behavior

the output is

{'input_ids': [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 4710, 104100, 71618, 117476, 3837, 104786, 99562, 69249, 100134, 100206, 104316, 1773, 106922, 102364, 90286, 21894, 100032, 3837, 105984, 99879, 104380, 104303, 100684, 104014, 103170, 1773, 35946, 85106, 99912, 106187, 1365, 10236, 236, 108, 82224, 99257, 3837, 101901, 100690, 99794, 99717, 33108, 99257, 102054, 1773, 101886, 3837, 35946, 103930, 18493, 100134, 101072, 101061, 101411, 99257, 33108, 102774, 73345, 1773, 35946, 106057, 100634, 104907, 99600, 3837, 43288, 104029, 106211, 57218, 100182, 99878, 99653, 33108, 104693, 104063, 3837, 104477, 106931, 17447, 99794, 34187, 104056, 85106, 1773, 35946, 104241, 109233, 99471, 99218, 102774, 3837, 43288, 100364, 35946, 33126, 100403, 29490, 99794, 34187, 101356, 104385, 108555, 33108, 100352, 104238, 1773, 103941, 3837, 35946, 109981, 101958, 104191, 101356, 100352, 99661, 104066, 3837, 106811, 107203, 104380, 102193, 3837, 105320, 100647, 111930, 3837, 100006, 105128, 54542, 100646, 99559, 3837, 105344, 101929, 116932, 1773, 106039, 99257, 103941, 104029, 18493, 111978, 15946, 105067, 102334, 19108, 1773, 1406, 104210, 70589, 107083, 108704, 43815, 102104, 5122, 220, 56568, 101061, 9370, 104907, 99600, 33108, 102774, 73345, 73670, 99278, 104314, 102224, 99912, 106187, 33108, 104619, 11319, 4710, 151645, 198, 151644, 77091, 198, 101061, 100634, 104907, 99600, 33108, 102774, 73345, 104029, 106211, 57218, 100182, 99878, 99653, 33108, 104693, 104063, 3837, 99794, 104056, 85106, 1773, 18493, 109233, 99471, 99218, 102774, 104029, 33126, 100403, 29490, 99794, 34187, 101356, 104385, 108555, 33108, 100352, 104238, 1773, 100001, 99912, 106187, 33108, 104619, 100364, 35946, 105344, 101929, 116932, 90395, 18493, 111978, 15946, 104847, 19108, 1773, 151645, 198, 151644, 77091, 198], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'assistant_masks': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

I the assistant_masks is all zero, it is not correct

DogeWatch avatar Oct 15 '24 09:10 DogeWatch