FastChat
FastChat copied to clipboard
</s> tokenization via sentencepiece
Hi, I have a question. here: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py#L42 in the case of Vicuna / multi-round, there is a '' added at the end of each response. However, I am wondering if sentencepiece preserves the '' at tokenization. I have the feeling (when doing it manually) that it it tokenized as "</" "s" ">" Am I missing something?
actually behavior seems normal according to this: https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md it says: " Control symbols must be inserted outside of the SentencePiece segmentation."
SO then, I don't understand the insertion of sep2 in plain text as mentioned above.
@vince62s seems like this is a bit stalled here. Do you still have this question? :-)
@vince62s feel free to open again if you still have question.