bert_score icon indicating copy to clipboard operation
bert_score copied to clipboard

Padding token ID

Open BramVanroy opened this issue 2 years ago • 2 comments
trafficstars

I was looking at the code and came across this line

https://github.com/Tiiiger/bert_score/blob/dbcf6db37e8bd6ff68446f06b0ba5d0763b62d20/bert_score/utils.py#L634

I might be wrong, but shouldn't the padding value be dependent on the tokenizer and model, e.g. padding_value=tokenizer.pad_token_id? It cannot be ensured that all models use 2 as the padding index.

Here as well:

https://github.com/Tiiiger/bert_score/blob/dbcf6db37e8bd6ff68446f06b0ba5d0763b62d20/bert_score/utils.py#L548-L549

BramVanroy avatar Apr 03 '23 20:04 BramVanroy

I tried replacing it with tokenizer.pad_token_id but that yielded NaNs so I guess that I was wrong. What does the 2 indicate? Is it not supposed to be the padding token?

BramVanroy avatar Apr 03 '23 21:04 BramVanroy

@BramVanroy I am also interested in this.

I'm taking a guess here, but I think they use 2 because a mask array is boolean [0, 1]. Then when summing for the zero_mask it's a good signal that the entire sentence is empty?

@Tiiiger or @felixgwu - care to weigh in?

ahgraber avatar Sep 05 '23 20:09 ahgraber