FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

</s> tokenization via sentencepiece

Open vince62s opened this issue 1 year ago • 1 comments

Hi, I have a question. here: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py#L42 in the case of Vicuna / multi-round, there is a '' added at the end of each response. However, I am wondering if sentencepiece preserves the '' at tokenization. I have the feeling (when doing it manually) that it it tokenized as "</" "s" ">" Am I missing something?

vince62s avatar Apr 18 '23 18:04 vince62s

actually behavior seems normal according to this: https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md it says: " Control symbols must be inserted outside of the SentencePiece segmentation."

SO then, I don't understand the insertion of sep2 in plain text as mentioned above.

vince62s avatar Apr 18 '23 18:04 vince62s

@vince62s seems like this is a bit stalled here. Do you still have this question? :-)

surak avatar Oct 21 '23 15:10 surak

@vince62s feel free to open again if you still have question.

infwinston avatar Oct 21 '23 15:10 infwinston