bert Question: What does "pooler layer" mean? Why it called pooler?

This question is just about the term "pooler", and maybe more of an English question than a question about BERT.

By reading this repository and its issues, I found the "pooler layer" is put after Transformer encoder stacks, ant it changes depends on the training task. but I can't understand why it is called "pooler".

I googled about the word "pooler" and "pooler layer", and it seems that this is not ML terminology.

BTW, The pooling layer, which appears on CNN something, is a similar word, but it seems to be a different thing.

Jun 11 '20 08:06 miyamonz

I agree that the name pooler might be a little confusing. The BERT model can be divided into three parts for understanding it easily

Embedding layer: Gets the embeddings from one-hot encodings of the words
Encoder: This is the transformer with self attention heads
Pooler: It takes the output representation corresponding to the first token and uses it for downstream tasks

In the paper which describes BERT, after passing a sentence through the model, the representation corresponding to the first token in the output is used for fine-tuning on tasks like SQuAD and GLUE. So the pooler layer does precisely that, applies a linear transformation over the representation of the first token. The linear transformation is trained while using the Next Sentence Prediction (NSP) strategy.

Jun 14 '20 14:06 ameet-1997

I think it's ok to call "pooler" layer.

This layer transforms the output shape of the Transformer from [batch_size, seq_length, hidden_size] to [batch_size, hidden_size]. This is similar to GlobalMaxPool1D, but not maxpooling, only the first word directly.

So functionally speaking, this is the pooling.

Jun 16 '20 03:06 secsilm

The linear transformation is trained while using the Next Sentence Prediction (NSP) strategy.

Hi, I have a question about this NSP purpose. Since the pooler is used on downstream tasks, like sentence classification, is it helpful to use a pooler that was trained for predicting the next sentence? Since the task now is to predict a label...

Thanks

Sep 18 '20 02:09 guoxuxu

@secsilm I understand that might be doing some kind of GlobalMaxPool1D, however do you know what exact algorithm they are using to reduce the dimension, I am afraid they are using "max" which is used in GlobalMaxPool1D.

Thanks

Mar 28 '21 07:03 amandalmia14

@secsilm I understand that might be doing some kind of GlobalMaxPool1D, however do you know what exact algorithm they are using to reduce the dimension, I am afraid they are using "max" which is used in GlobalMaxPool1D.

Thanks

Not max. They just use the vector of first token to represent the whole sequence.

Mar 29 '21 00:03 secsilm

Correct. For most tasks, the first token is a special token (such as [CLS] for classification tasks). This is why tokens like [CLS] are a thing.

Nov 23 '22 00:11 MonliH

bert bert copied to clipboard

Question: What does "pooler layer" mean? Why it called pooler?

bert
bert copied to clipboard