SudachiTra icon indicating copy to clipboard operation
SudachiTra copied to clipboard

The entry of `\n` in `vocab.txt` is causing token index shifting

Open hiroshi-matsuda-rit opened this issue 1 year ago • 0 comments

It seems \n is causing token index shifting after the line 10295 in vocab.txt.

$ less -N vocab.txt
...
  10294 ##錄
  10295 
  10296 
  10297 する

Fortunately, I did not find any performance degrading in downstream tasks caused by this index shifting, but got an error message while executing save_pretrained(). https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/bert/tokenization_bert.py#L357

Saving vocabulary to vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted!

The line 10295 in vocab.txt should be some non-existent word like !!!DIFECTED!!!, I think.

Also see #57.

hiroshi-matsuda-rit avatar May 08 '23 08:05 hiroshi-matsuda-rit