indonlu icon indicating copy to clipboard operation
indonlu copied to clipboard

Different Vocab Size Between Tokenizer and Model's Word Embedding Layer

Open louisowen6 opened this issue 4 years ago • 0 comments

Expected Behavior

The length of tokenizer vocab size and the BERT's word embedding layer dimension should be the same

Actual Behavior

The length of tokenizer vocab size and the BERT's word embedding layer dimension is not the same

Steps to Reproduce the Problem

  1. Load the model: model = AutoModel.from_pretrained('indobenchmark/indobert-base-p1')
  2. Print the model: print(model)

image

  1. Load the tokenizer: tokenizer = AutoTokenizer.from_pretrained('indobenchmark/indobert-base-p1')
  2. Print the length of toikenizer: print(len(tokenizer)) image

louisowen6 avatar Jul 23 '21 09:07 louisowen6