LLaMA-Factory Template should not be truncated

In the data preprocessing function,

https://github.com/hiyouga/LLaMA-Factory/blob/main/src/llmtuner/data/preprocess.py#L93

data is truncated after encode_multiturn if it's too long. However, some templates add special tokens after the query, which should not be truncated. The expectation is to truncate only the input query, excluding the special tokens that follows the query.

Nov 21 '23 03:11 dawnranger

Seems like it would need to add an input on the demo as well, @hiyouga is this a valuable feature to add?

Nov 21 '23 05:11 puffy310

@puffy310 will fix soon

Nov 21 '23 05:11 hiyouga

@dawnranger Why might system tokens appear after the query? Could you provide some examples?

Nov 27 '23 03:11 Louis-y-nlp

@dawnranger Why might system tokens appear after the query? Could you provide some examples?

For example, in chatglm2 template, \n\n答： is after {{query}}；in baichuan template, <reserved_103> is after {{query}}; in qwen template, <|im_end|>\n<|im_start|>assistant\n is after {{query}}.

chatglm2:

register_template(
    name="chatglm2",
    prompt=[
        "[Round {{idx}}]\n\n问：{{query}}\n\n答："
    ]
)

baichuan

register_template(
    name="baichuan",
    prompt=[
        {"token": "<reserved_102>"}, # user token
        "{{query}}",
        {"token": "<reserved_103>"}  # assistant token
    ]
)

qwen

register_template(
    name="qwen",
    prompt=[
        {"token": "<|im_start|>"},
        "user\n{{query}}",
        {"token": "<|im_end|>"},
        "\n",
        {"token": "<|im_start|>"},
        "assistant\n"
    ]
)

Nov 29 '23 07:11 dawnranger

@dawnranger Why might system tokens appear after the query? Could you provide some examples?

For example, in chatglm2 template, \n\n答： is after {{query}}；in baichuan template, <reserved_103> is after {{query}}; in qwen template, <|im_end|>\n<|im_start|>assistant\n is after {{query}}.

chatglm2:
register_template(
    name="chatglm2",
    prompt=[
        "[Round {{idx}}]\n\n问：{{query}}\n\n答："
    ]
)
baichuan
register_template(
    name="baichuan",
    prompt=[
        {"token": "<reserved_102>"}, # user token
        "{{query}}",
        {"token": "<reserved_103>"}  # assistant token
    ]
)
qwen
register_template(
    name="qwen",
    prompt=[
        {"token": "<|im_start|>"},
        "user\n{{query}}",
        {"token": "<|im_end|>"},
        "\n",
        {"token": "<|im_start|>"},
        "assistant\n"
    ]
)

Thank you, I forget these special tokens.

Nov 30 '23 02:11 Louis-y-nlp

LLaMA-Factory LLaMA-Factory copied to clipboard

Template should not be truncated

LLaMA-Factory
LLaMA-Factory copied to clipboard