Fengshenbang-LM icon indicating copy to clipboard operation
Fengshenbang-LM copied to clipboard

T5 tokenize过程貌似有bug?

Open Fu-Dayuan opened this issue 1 year ago • 1 comments

如果tokenize “阅读者”,在结果中没有pad token(290)(仅有阅读、者、结束符三个token).我在其他的例子中都没有发现这个bug

Fu-Dayuan avatar Jun 13 '23 01:06 Fu-Dayuan

方便贴出你使用的例子代码吗?

我这边用以下代码 max_length, padding 测试应该是正常的。

>>> T5Tokenizer.from_pretrained("IDEA-CCNL/Randeng-T5-784M-QA-Chinese")
>>> tokenizer.encode("阅读者",max_length=100, padding='max_length')
[11622, 1290, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Desein-Yang avatar Sep 04 '23 08:09 Desein-Yang