bert icon indicating copy to clipboard operation
bert copied to clipboard

Question: What does "pooler layer" mean? Why it called pooler?

Open miyamonz opened this issue 4 years ago • 6 comments

This question is just about the term "pooler", and maybe more of an English question than a question about BERT.

By reading this repository and its issues, I found the "pooler layer" is put after Transformer encoder stacks, ant it changes depends on the training task. but I can't understand why it is called "pooler".

I googled about the word "pooler" and "pooler layer", and it seems that this is not ML terminology.

BTW, The pooling layer, which appears on CNN something, is a similar word, but it seems to be a different thing.

miyamonz avatar Jun 11 '20 08:06 miyamonz

I agree that the name pooler might be a little confusing. The BERT model can be divided into three parts for understanding it easily

  1. Embedding layer: Gets the embeddings from one-hot encodings of the words
  2. Encoder: This is the transformer with self attention heads
  3. Pooler: It takes the output representation corresponding to the first token and uses it for downstream tasks

In the paper which describes BERT, after passing a sentence through the model, the representation corresponding to the first token in the output is used for fine-tuning on tasks like SQuAD and GLUE. So the pooler layer does precisely that, applies a linear transformation over the representation of the first token. The linear transformation is trained while using the Next Sentence Prediction (NSP) strategy.

ameet-1997 avatar Jun 14 '20 14:06 ameet-1997

I think it's ok to call "pooler" layer.

This layer transforms the output shape of the Transformer from [batch_size, seq_length, hidden_size] to [batch_size, hidden_size]. This is similar to GlobalMaxPool1D, but not maxpooling, only the first word directly.

So functionally speaking, this is the pooling.

secsilm avatar Jun 16 '20 03:06 secsilm

The linear transformation is trained while using the Next Sentence Prediction (NSP) strategy.

Hi, I have a question about this NSP purpose. Since the pooler is used on downstream tasks, like sentence classification, is it helpful to use a pooler that was trained for predicting the next sentence? Since the task now is to predict a label...

Thanks

guoxuxu avatar Sep 18 '20 02:09 guoxuxu

@secsilm I understand that might be doing some kind of GlobalMaxPool1D, however do you know what exact algorithm they are using to reduce the dimension, I am afraid they are using "max" which is used in GlobalMaxPool1D.

Thanks

amandalmia14 avatar Mar 28 '21 07:03 amandalmia14

@secsilm I understand that might be doing some kind of GlobalMaxPool1D, however do you know what exact algorithm they are using to reduce the dimension, I am afraid they are using "max" which is used in GlobalMaxPool1D.

Thanks

Not max. They just use the vector of first token to represent the whole sequence.

secsilm avatar Mar 29 '21 00:03 secsilm

Correct. For most tasks, the first token is a special token (such as [CLS] for classification tasks). This is why tokens like [CLS] are a thing.

MonliH avatar Nov 23 '22 00:11 MonliH