albert Discrepancy in tokenization results using albert's tokenizer and sentencepiece library

Discrepancy in tokenization results using albert's tokenizer and sentencepiece library

Open anjali-chadha opened this issue 3 years ago • 0 comments

Hi -

I recently noticed that tokenized results from albert's tokenizer implementation and sentencepiece library differ for some inputs. Check below:

SentencePiece Implementation

!pip install sentencepiece

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('<SPM_MODEL>')
print(sp.encode_as_pieces('3.0,'))
print(sp.encode_as_ids('3.0,'))

Output:
['▁3.0,']
[72369]

Using Albert

pip install sentencepiece
git clone https://github.com/google-research/albert.git

>> import tokenization
>>> spm_tokenizer = tokenization.FullTokenizer(vocab_file=<VOCAB_FILE>, spm_model_file=<SPM_MODEL_FILE>) 
>>> spm_tokenizer.convert_tokens_to_ids(spm_tokenizer.tokenize("3.0,"))

Output:
[16047, 254713]

After looking at Albert's tokenizer implementation, I see that the if condition here is leading to the differences in the outputs above. https://github.com/google-research/albert/blob/master/tokenization.py#L67

Could you explain the intuition behind having this additional steps in albert's tokenizer and what purpose do they serve here?

Thanks!

Oct 18 '21 22:10 anjali-chadha

albert albert copied to clipboard

Discrepancy in tokenization results using albert's tokenizer and sentencepiece library

albert
albert copied to clipboard