LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

Template should not be truncated

Open dawnranger opened this issue 2 years ago • 5 comments

In the data preprocessing function,

https://github.com/hiyouga/LLaMA-Factory/blob/main/src/llmtuner/data/preprocess.py#L93

data is truncated after encode_multiturn if it's too long. However, some templates add special tokens after the query, which should not be truncated. The expectation is to truncate only the input query, excluding the special tokens that follows the query.

dawnranger avatar Nov 21 '23 03:11 dawnranger

Seems like it would need to add an input on the demo as well, @hiyouga is this a valuable feature to add?

puffy310 avatar Nov 21 '23 05:11 puffy310

@puffy310 will fix soon

hiyouga avatar Nov 21 '23 05:11 hiyouga

@dawnranger Why might system tokens appear after the query? Could you provide some examples?

Louis-y-nlp avatar Nov 27 '23 03:11 Louis-y-nlp

@dawnranger Why might system tokens appear after the query? Could you provide some examples?

For example, in chatglm2 template, \n\n答: is after {{query}};in baichuan template, <reserved_103> is after {{query}}; in qwen template, <|im_end|>\n<|im_start|>assistant\n is after {{query}}.

  • chatglm2:
register_template(
    name="chatglm2",
    prompt=[
        "[Round {{idx}}]\n\n问:{{query}}\n\n答:"
    ]
)
  • baichuan
register_template(
    name="baichuan",
    prompt=[
        {"token": "<reserved_102>"}, # user token
        "{{query}}",
        {"token": "<reserved_103>"}  # assistant token
    ]
)
  • qwen
register_template(
    name="qwen",
    prompt=[
        {"token": "<|im_start|>"},
        "user\n{{query}}",
        {"token": "<|im_end|>"},
        "\n",
        {"token": "<|im_start|>"},
        "assistant\n"
    ]
)

dawnranger avatar Nov 29 '23 07:11 dawnranger

@dawnranger Why might system tokens appear after the query? Could you provide some examples?

For example, in chatglm2 template, \n\n答: is after {{query}};in baichuan template, <reserved_103> is after {{query}}; in qwen template, <|im_end|>\n<|im_start|>assistant\n is after {{query}}.

  • chatglm2:
register_template(
    name="chatglm2",
    prompt=[
        "[Round {{idx}}]\n\n问:{{query}}\n\n答:"
    ]
)
  • baichuan
register_template(
    name="baichuan",
    prompt=[
        {"token": "<reserved_102>"}, # user token
        "{{query}}",
        {"token": "<reserved_103>"}  # assistant token
    ]
)
  • qwen
register_template(
    name="qwen",
    prompt=[
        {"token": "<|im_start|>"},
        "user\n{{query}}",
        {"token": "<|im_end|>"},
        "\n",
        {"token": "<|im_start|>"},
        "assistant\n"
    ]
)

Thank you, I forget these special tokens.

Louis-y-nlp avatar Nov 30 '23 02:11 Louis-y-nlp