albert
albert copied to clipboard
Discrepancy in tokenization results using albert's tokenizer and sentencepiece library
Hi -
I recently noticed that tokenized results from albert's tokenizer implementation and sentencepiece library differ for some inputs. Check below:
SentencePiece Implementation
!pip install sentencepiece
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('<SPM_MODEL>')
print(sp.encode_as_pieces('3.0,'))
print(sp.encode_as_ids('3.0,'))
Output:
['▁3.0,']
[72369]
Using Albert
pip install sentencepiece
git clone https://github.com/google-research/albert.git
>> import tokenization
>>> spm_tokenizer = tokenization.FullTokenizer(vocab_file=<VOCAB_FILE>, spm_model_file=<SPM_MODEL_FILE>)
>>> spm_tokenizer.convert_tokens_to_ids(spm_tokenizer.tokenize("3.0,"))
Output:
[16047, 254713]
After looking at Albert's tokenizer implementation, I see that the if condition here is leading to the differences in the outputs above. https://github.com/google-research/albert/blob/master/tokenization.py#L67
Could you explain the intuition behind having this additional steps in albert's tokenizer and what purpose do they serve here?
Thanks!