bert_score icon indicating copy to clipboard operation
bert_score copied to clipboard

BERTScore can match contextualized embeddings of `[SEP]`/`[CLS]` tokens

Open asumagic opened this issue 1 year ago • 0 comments

During the IDF dict calculation, the weight associated with special tokens is zeroed:

https://github.com/Tiiiger/bert_score/blob/dbcf6db37e8bd6ff68446f06b0ba5d0763b62d20/bert_score/score.py#L243-L246

But, to my understanding of the code, this weight never actually prevents a non-special token embedding from getting matched with a [SEP] or [CLS] token embedding.

I've noticed this because I was obtaining different recall/precision values on certain pairs on a custom implementation. This difference disappears if I stop masking pairs involving a special token in the cosine similarity matrix.

That code looks something like:

ref_mask = self._select_by_tokens(token_masks, ref_tokens)
hyp_mask = self._select_by_tokens(token_masks, hyp_tokens)

# mask rows according to ref_mask and columns according to hyp_mask
# reminder: this is the mask used to mask off special tokens
similarity_matrix[~ref_mask, :] = 0.0
similarity_matrix.transpose(1, 2)[~hyp_mask, :] = 0.0

Testing with no IDF, using google-bert/bert-base-uncased at layer 12 (not a really thought-out choice, it's just for the repro), the following pair of sentences reproduces the issue:

  • ref: "WE'LL COME IN HERE THIS AFTERNOON WITH OLD CLOTHES ON AND HAVE A REGULAR HOUSE CLEANING"
  • hyp: "WILL COME IN HERE THIS AFTERNOON WITH OLD CLOTHES ON AND HALF A REGULAR HOUSE CLEANING"

With my implementation, greedy selection through the matrix shows a difference over the 2nd (non-special) token:

  • disabled masking 0.70251393, 0.95448172, 0.45837021, ..., resulting in a recall of 0.82332665 (matches bert-score)
  • enabled masking 0.70251393, 0.18742326, 0.45837021, ..., resulting in a recall of 0.78071225

Inspecting the cosine similarity matrix indicates that 0.95448172 is the similarity between the 2nd token and the last token ([SEP]).

I don't know if this is intended, but since those special tokens are weighted down to 0 in the IDF dict, I'm assuming the intent is to never actually consider them. I have not tried to check whether that degrades the quality of the metric, so maybe it doesn't matter. In any case, I felt like this was worth documenting as an issue.

asumagic avatar Mar 19 '24 10:03 asumagic