SudachiTra
SudachiTra copied to clipboard
The entry of `\n` in `vocab.txt` is causing token index shifting
It seems \n
is causing token index shifting after the line 10295 in vocab.txt
.
$ less -N vocab.txt
...
10294 ##錄
10295
10296
10297 する
Fortunately, I did not find any performance degrading in downstream tasks caused by this index shifting, but got an error message while executing save_pretrained()
.
https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/bert/tokenization_bert.py#L357
Saving vocabulary to vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted!
The line 10295 in vocab.txt
should be some non-existent word like !!!DIFECTED!!!
, I think.
Also see #57.