bert_score
bert_score copied to clipboard
Padding token ID
I was looking at the code and came across this line
https://github.com/Tiiiger/bert_score/blob/dbcf6db37e8bd6ff68446f06b0ba5d0763b62d20/bert_score/utils.py#L634
I might be wrong, but shouldn't the padding value be dependent on the tokenizer and model, e.g. padding_value=tokenizer.pad_token_id? It cannot be ensured that all models use 2 as the padding index.
Here as well:
https://github.com/Tiiiger/bert_score/blob/dbcf6db37e8bd6ff68446f06b0ba5d0763b62d20/bert_score/utils.py#L548-L549
I tried replacing it with tokenizer.pad_token_id but that yielded NaNs so I guess that I was wrong. What does the 2 indicate? Is it not supposed to be the padding token?
@BramVanroy I am also interested in this.
I'm taking a guess here, but I think they use 2 because a mask array is boolean [0, 1]. Then when summing for the zero_mask it's a good signal that the entire sentence is empty?
@Tiiiger or @felixgwu - care to weigh in?