german-gpt2
german-gpt2 copied to clipboard
Surprisal calculation?
Hi @stefan-it,
I'm trying to calculate surprisal scores from the outputs. The scores seem to be on the higher to very high side, as you can see from this minimal example:
import torch.nn.functional as F
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")
model = AutoModelForCausalLM.from_pretrained("dbmdz/german-gpt2")
inputs = tokenizer("Ich bin der Geist, der stets verneint! Und das mit Recht; denn alles, was entsteht, Ist wert, daß es zugrunde geht; Drum besser wär's, daß nichts entstünde.")
inputs = transformers.BatchEncoding(
{"input_ids": torch.tensor([50256] + inputs["input_ids"]),
"attention_mask": torch.tensor([1] + inputs["attention_mask"])}
)
output_ids = inputs["input_ids"][1:]
with torch.no_grad():
outputs = model(**inputs)
neglogprobs = -1*torch.log2(F.softmax(outputs.logits, dim=-1))
surprisal = neglogprobs[0, output_ids]
print(surprisal.numpy())
print(tokenizer.decode(output_ids))
The posts above pointed me to use 50256
as the bos token. However, and as you already mentioned, 50256
is decoded to riegel
, which makes me wonder if I'm doing this right. It would probably be better to use the <|endoftext|>
token for this, like the original GPT2.
I also tried minicons, which throws an OOV error because it prefixes <|endoftext|>
to all sequences.