german-gpt2 icon indicating copy to clipboard operation
german-gpt2 copied to clipboard

Surprisal calculation?

Open justeuer opened this issue 1 year ago • 0 comments

Hi @stefan-it,

I'm trying to calculate surprisal scores from the outputs. The scores seem to be on the higher to very high side, as you can see from this minimal example:

import torch.nn.functional as F
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")
model = AutoModelForCausalLM.from_pretrained("dbmdz/german-gpt2")

inputs = tokenizer("Ich bin der Geist, der stets verneint! Und das mit Recht; denn alles, was entsteht, Ist wert, daß es zugrunde geht; Drum besser wär's, daß nichts entstünde.")
inputs = transformers.BatchEncoding(
    {"input_ids": torch.tensor([50256] + inputs["input_ids"]),
     "attention_mask": torch.tensor([1] + inputs["attention_mask"])}
)
output_ids = inputs["input_ids"][1:]
with torch.no_grad():
    outputs = model(**inputs)
    neglogprobs = -1*torch.log2(F.softmax(outputs.logits, dim=-1))
    surprisal = neglogprobs[0, output_ids]
    print(surprisal.numpy())
    print(tokenizer.decode(output_ids))

The posts above pointed me to use 50256 as the bos token. However, and as you already mentioned, 50256 is decoded to riegel, which makes me wonder if I'm doing this right. It would probably be better to use the <|endoftext|> token for this, like the original GPT2.

I also tried minicons, which throws an OOV error because it prefixes <|endoftext|> to all sequences.

justeuer avatar May 18 '23 10:05 justeuer