flair icon indicating copy to clipboard operation
flair copied to clipboard

[Question]: How does .embed(Sentence) work under the hood?

Open teoML opened this issue 6 months ago • 3 comments

Question

Hi, Can someone explain how the following code works under the hood:

 #loading a bert model
 embedding = TransformerDocumentEmbeddings("dbmdz/bert-base-german-uncased")

 # create a sentence
 sentence = Sentence('The grass is green. The roses are red.')

 # embed words in sentence
 embedding.embed(sentence)

 print(sentence.embedding)`

When we perform embed on a Sentence object, does it transform each token into a vector and then the resulted vector is the average of the sum of the vectors or it is using another strategy to generate the final document embedding? Like, in my example sentence(or document) the resulted vector will be: (embedding of the token "the" + embedding of the token "grass" + .... + embedding of the token "red")/8 (dimention-wise).

Maybe @alanakbik could answer? Thank you!

teoML avatar Feb 21 '24 21:02 teoML

Hi @teoML According to the Bert paper each sentence has a [CLS] token which will be representing the sentence for classification tasks. The default behavior is also using this. However if you want to change that, you can look at the cls_pooling parameter in the docs. The other pooling strategies apply the respective function on the embeddings of the individual tokens.

helpmefindaname avatar Mar 01 '24 09:03 helpmefindaname

Hi @helpmefindaname , thank you for answering my question! I just tried out changing the pooling type and I get the same embedding for the same sentence:

embedding = TransformerDocumentEmbeddings("dbmdz/bert-base-german-uncased", allow_long_sentences= True, cls_polling = "mean")

embedding_2 = TransformerDocumentEmbeddings("dbmdz/bert-base-german-uncased", allow_long_sentences= True, cls_polling = "cls")

# create a sentence
sentence1 = Sentence('The grass is green. The roses are red.')
sentence2 = Sentence('The grass is green. The roses are red.')

embedding.embed(sentence1)
embedding_2.embed(sentence2)

a = sentence1.embedding
b = sentence2.embedding

print(a == b)

Each element in the resulted 768 dimensions long vector is True which means that there is no change in the embeddings of both sentences (although the pooling type is different). Is that considered normal behaviour? I tried out also with max and I also got equal vectors. It would be nice if you can help me out - my purpose is to create an embedding for a document which consists of around 100 sentences.

teoML avatar Mar 04 '24 16:03 teoML

the parameter is cls_pooling not cls_polling hence both embeddings are using "cls" per default

helpmefindaname avatar Mar 29 '24 13:03 helpmefindaname