FastChat
FastChat copied to clipboard
New weights need more tokens for prompt
The new version of weights use upper case role names for the prompt:
Example prompt (Weight v1.1) A chat between a user and an assistant.
USER: Hello! ASSISTANT: Hello! USER: How are you? ASSISTANT: I am good.
Compare different prompt styles:
print(tokenizer.sp_model.encode('. ### Human: This is a ', out_type=str))
print(tokenizer.sp_model.encode('. ### Assistant: This is a ', out_type=str))
print(tokenizer.sp_model.encode('. USER: This is a ', out_type=str))
print(tokenizer.sp_model.encode('. ASSISTANT: This is a ', out_type=str))
print(tokenizer.sp_model.encode('. User: This is a ', out_type=str))
print(tokenizer.sp_model.encode('. Assistant: This is a ', out_type=str))
print(tokenizer.sp_model.encode('. user: This is a ', out_type=str))
print(tokenizer.sp_model.encode('. assistant: This is a ', out_type=str))
The output is here:
['▁.', '▁###', '▁Human', ':', '▁This', '▁is', '▁a', '▁']
['▁.', '▁###', '▁Ass', 'istant', ':', '▁This', '▁is', '▁a', '▁']
['▁.', '▁US', 'ER', ':', '▁This', '▁is', '▁a', '▁']
['▁.', '▁A', 'SS', 'IST', 'ANT', ':', '▁This', '▁is', '▁a', '▁']
['▁.', '▁User', ':', '▁This', '▁is', '▁a', '▁']
['▁.', '▁Ass', 'istant', ':', '▁This', '▁is', '▁a', '▁']
['▁.', '▁user', ':', '▁This', '▁is', '▁a', '▁']
['▁.', '▁assistant', ':', '▁This', '▁is', '▁a', '▁']
The upper case version uses 5 tokens for "ASSISTANT:" while only 2 for "assistant:". So why not just use lower case role names to spare token space for conversation content?
@78 There is nothing special.
We just happened to use the upper case during our training. I guess in your fine-tuning job you can try to use a different prefix.
CC @merrymercy to comment more.
My question: why does this matter to you? Does your task require a longer context length > 2048?
Please try our latest Vicuna-13B-v1.3 or LongChat.
The issue is stale, so closing.