bert Is BERT powerful enough to learn sentence embedding and word embedding?

After reading the BERT, Pre-training of Deep Bidirectional Transformers fo r Language Understanding paper, I had a fundamental question want to figure out.

Based on my current understanding, I think the main contribution of BERT is learning sentence embedding or capturing sentence internal structure in an unsupervised way. When training the model, the authors said:

We use WordPiece embeddings (Wu et al.,2016) with a 30,000 token vocabulary. We denote split word pieces with ##.

It seems that the loaded word embedding was pre-trained. However, the parameters of the word embedding layer were randomly initialized in the open source tensorflow BERT code. This inconsistency confused me a lot.

So My question is:

Is BERT could also learn powerful word embedding representation compared with the state-of-the-art word embedding algorithms?

Dec 13 '18 04:12 xiaoming-qxm

You may use bert-as-service for a quick evaluation by yourself. Sentence and ELMO-like token-level embedding are fairly easy to obtain using this service.

Dec 19 '18 06:12 hanxiao

In my own understanding, word embedding just a set of paraments, just like the self attention parament, the parament will be more useful when they are used together, if you just use the trained word embedding, it maybe perform poorly.

Dec 28 '18 12:12 xwzhong

Hello @daoliker ,

From my colleague's works, he replicated many SOTA NLP tasks and tried to replace all previous word representation to BERT. Most of the works get significant performance improvement. He doesn't try to use end-to-end fine-tuning on those tasks, because BERT consumes a lot of resources.

If you want to get the word embedding from BERT. I implement a BERT embedding library which makes you can get word embedding in a programatic way.

https://github.com/imgarylai/bert-embedding

Because I'm working closely with mxnet & gluonnlp team, my implementation is done by using mxnet and gluonnlp. However, I am trying to implement it in all other different frameworks.

Hope my works can help you.

Feb 10 '19 19:02 imgarylai

Hi, After running the BERT Model, I am getting embedding for each word in a sentence, But need to get the sentence embedding. How to find that?

I tried to do max-pooling of all the word embedding but the output is not good.

Mar 27 '19 10:03 abhinandansrivastava

@abhinandansrivastava then perhaps try different pooling strategies using bert-as-service

Mar 27 '19 10:03 hanxiao

@abhinandansrivastava A naive but strong sentence embedding baseline is average word Embedding.

Apr 08 '19 10:04 ChenXi1992

@ChenXi1992 agree with average embedding, but in case of BERT it doesn't work very well...I have tested this.

May 07 '19 08:05 singularity014

Are the word piece embeddings for BERT pretrained or randomly initialized when the BERT model was originally trained?

May 24 '19 20:05 BoPengGit

@abhinandansrivastava I think you can use the CLS provided by BERT for sentence embeddings

Aug 08 '19 15:08 xdwang0726

@abhinandansrivastava I think you can use the CLS provided by BERT for sentence embeddings

I don't think it's a good idea for Non classification tasks. According to Transformers:

This output is usually not a good summary of the semantic content of the input, you're often better with averaging or pooling the sequence of hidden-states for the whole input sequence.

Also bert-as-servic

Because a pre-trained model is not fine-tuned on any downstream tasks yet. In this case, the hidden state of [CLS] is not a good sentence representation. If later you fine-tune the model, you may use [CLS] as well.

And I have proved it for text matching task.

Dec 11 '19 07:12 ksboy

can we use bert for marathi text

Dec 26 '19 11:12 kirtikakde

@abhinandansrivastava I think you can use the CLS provided by BERT for sentence embeddings

Can you explain more into what [CLS] captures? Why is the alternate not preferred, for instance, when we can take the embedding of the other tokens as well and reshape or pool based on use-case. More specifically, It would be great if someone can point me to what exactly [CLS] picks from the sentence that helps it represent it fully.

Feb 25 '20 09:02 rahulkrishnan98

hi, thank u for this job,i think its so great however i have trouble getting embedding of sentences.my texts are always very long(20k chars), i notice that max_seq_len 512 is the maximun number, is there any methods to get long sentences embedding? p.s. currently i split text to some short text(length less than 512),and get average embeddings of all text,but the output is not good how should i do this task,i am confused,

Jun 24 '20 09:06 Chandler-Bing

hi, thank u for this job,i think its so great however i have trouble getting embedding of sentences.my texts are always very long(20k chars), i notice that max_seq_len 512 is the maximun number, is there any methods to get long sentences embedding? p.s. currently i split text to some short text(length less than 512),and get average embeddings of all text,but the output is not good how should i do this task,i am confused,

You could write PositionalEncoding yourself in order to customize the sequence length.

Jun 24 '20 20:06 xdwang0726

Hi, After running the BERT Model, I am getting embedding for each word in a sentence, But need to get the sentence embedding. How to find that?

I tried to do max-pooling of all the word embedding but the output is not good.

Excuse me how did you get the embedding for word not a sentence using Bert please

Jan 09 '22 08:01 mathshangw