LLaMA-Factory
LLaMA-Factory copied to clipboard
Template should not be truncated
In the data preprocessing function,
https://github.com/hiyouga/LLaMA-Factory/blob/main/src/llmtuner/data/preprocess.py#L93
data is truncated after encode_multiturn if it's too long. However, some templates add special tokens after the query, which should not be truncated. The expectation is to truncate only the input query, excluding the special tokens that follows the query.
Seems like it would need to add an input on the demo as well, @hiyouga is this a valuable feature to add?
@puffy310 will fix soon
@dawnranger Why might system tokens appear after the query? Could you provide some examples?
@dawnranger Why might system tokens appear after the query? Could you provide some examples?
For example, in chatglm2 template, \n\n答: is after {{query}};in baichuan template, <reserved_103> is after {{query}}; in qwen template, <|im_end|>\n<|im_start|>assistant\n is after {{query}}.
- chatglm2:
register_template(
name="chatglm2",
prompt=[
"[Round {{idx}}]\n\n问:{{query}}\n\n答:"
]
)
- baichuan
register_template(
name="baichuan",
prompt=[
{"token": "<reserved_102>"}, # user token
"{{query}}",
{"token": "<reserved_103>"} # assistant token
]
)
- qwen
register_template(
name="qwen",
prompt=[
{"token": "<|im_start|>"},
"user\n{{query}}",
{"token": "<|im_end|>"},
"\n",
{"token": "<|im_start|>"},
"assistant\n"
]
)
@dawnranger Why might system tokens appear after the query? Could you provide some examples?
For example, in chatglm2 template,
\n\n答:is after{{query}};in baichuan template,<reserved_103>is after{{query}}; in qwen template,<|im_end|>\n<|im_start|>assistant\nis after{{query}}.
- chatglm2:
register_template( name="chatglm2", prompt=[ "[Round {{idx}}]\n\n问:{{query}}\n\n答:" ] )
- baichuan
register_template( name="baichuan", prompt=[ {"token": "<reserved_102>"}, # user token "{{query}}", {"token": "<reserved_103>"} # assistant token ] )
- qwen
register_template( name="qwen", prompt=[ {"token": "<|im_start|>"}, "user\n{{query}}", {"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"}, "assistant\n" ] )
Thank you, I forget these special tokens.