bert icon indicating copy to clipboard operation
bert copied to clipboard

How to predict the probability of an empty string using BERT

Open brienna opened this issue 3 years ago • 1 comments

Suppose we have a template sentence like this:

  • "The ____ house is our meeting place."

and we have a list of adjectives to fill in the blank, e.g.:

  • "yellow"
  • "large"
  • ""

Note that one of these is an empty string.

The goal is to compare the probabilities to select the most likely word to describe "house" given the context of the sentence. If it's more likely to have nothing, this should also be taken into consideration.

We can predict the probability of each word filling in the blank, but how would we predict the probability that an empty string fills in the blank, i.e. the probability of there being no adjective to describe "house"?

To predict the probability of a word:

from transformers import BertTokenizer, BertForMaskedLM
import torch
from torch.nn import functional as F

# Load BERT tokenizer and pre-trained model
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model = BertForMaskedLM.from_pretrained('bert-large-uncased', return_dict=True)

targets = ["yellow", "large"]
sentence = "The [MASK] house is our meeting place."

# Using BERT, compute probability over its entire vocabulary, returning logits
input = tokenizer.encode_plus(sentence, return_tensors = "pt") 
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)[0] 
with torch.no_grad():
    output = model(**input) 

# Run softmax over the logits to get the probabilities
softmax = F.softmax(output.logits[0], dim=-1)

# Find the words' probabilities in this probability distribution
target_probabilities = {t: softmax[mask_index, tokenizer.vocab[t]].numpy()[0] for t in targets}
target_probabilities

This outputs a list of the words and their associated probabilities:

{'yellow': 0.0061520976, 'large': 0.00071377633}

If I try to add an empty string to the list, I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-62-6f726220a108> in <module>
     18 
     19 # Find the words' probabilities in this probability distribution
---> 20 target_probabilities = {t: softmax[mask_index, tokenizer.vocab[t]].numpy()[0] for t in targets}
     21 target_probabilities

<ipython-input-62-6f726220a108> in <dictcomp>(.0)
     18 
     19 # Find the words' probabilities in this probability distribution
---> 20 target_probabilities = {t: softmax[mask_index, tokenizer.vocab[t]].numpy()[0] for t in targets}
     21 target_probabilities

KeyError: ''

This is because BERT's vocabulary contains no empty string, so we can't look up the probability of something that doesn't exist in the model.

How should we get the probability of there being no word to fill in the blank? Is this possible with the model? Does it make sense to use the empty token [PAD] instead of an empty string? (I've only seen [PAD] used at the end of sentences, to make a group of sentences the same length.)

brienna avatar Dec 28 '21 00:12 brienna

have you tried adding padding, '[PAD]' to the list?

seths10 avatar Jun 27 '22 11:06 seths10