BERT-keras icon indicating copy to clipboard operation
BERT-keras copied to clipboard

What's the meaning of TextEncoder.BERT_SPECIAL_COUNT, TextEncoder.TextEncoder.BERT_UNUSED_COUNT

Open ChiuHsin opened this issue 7 years ago • 4 comments

When I use the BERT-keras, I don't understand this part: class TextEncoder: PAD_OFFSET = 0 MSK_OFFSET = 1 BOS_OFFSET = 2 DEL_OFFSET = 3 # delimiter EOS_OFFSET = 4 SPECIAL_COUNT = 5 NUM_SEGMENTS = 2 BERT_UNUSED_COUNT = 99 # bert pretrained models BERT_SPECIAL_COUNT = 4 # they don't have DEL Why would you set it up like this? and the BERT_UNUSED_COUNT = 99 BERT_SPECIAL_COUNT = 4 are used in load_google_bert.

ChiuHsin avatar Jan 14 '19 07:01 ChiuHsin

Hi, There are some special tokens in the vocabulary(for example BOS stands for Beginning Of Sentence) and we can either put them at the beginning of a lookup table(embedding) or at the end. I decided to put them at the beginning. And for the "UNUSED_COUNT" you can check the vocab files in pretrained BERT models.

Separius avatar Jan 14 '19 07:01 Separius

Ah, you might be confused by their usage, right? Let's say you want to feed a sentence into your network, so you should add the BOS and EOS tokens to your sentence and you should know their locations in the embedding table

Separius avatar Jan 14 '19 07:01 Separius

I see, but when I load_google_bert model, the vocab_size = vocab_size - TextEncoder.BERT_SPECIAL_COUNT - TextEncoder.BERT_UNUSED_COUNT, but it doesn't match when w_id ==2 'weights[w_id][vocab_size + TextEncoder.EOS_OFFSET] = saved[3 + TextEncoder.BERT_UNUSED_COUNT] ' this line can not load the weight.

ChiuHsin avatar Jan 16 '19 07:01 ChiuHsin

@ChiuHsin I guess you are right, and it seems that you were able to solve it(based on the other issue you posted) can you please send a pull request to correct this problem? thanks!

Separius avatar Feb 02 '19 19:02 Separius