Shawn Xu

Results 2 comments of Shawn Xu

I checked the wudao dataest and found there are some irregular question marks in the text. Is this the cause of the problem? ![image](https://user-images.githubusercontent.com/37136730/228228196-5c6ecafd-9fa4-41d2-a6d9-bcaea956cf0a.png)

OK. The reason is that the trained tokenizer encounter some unseen tokens while pretraining such as "岿". Maybe the vocabulary of GLM10bchinese is not big enough.