german-gpt2
german-gpt2 copied to clipboard
Incorrect vocab size
The vocab_size
in the config.json
is set to 52000, however, the eos_token_id
is also set to 52000, which causes an IndexError
in the forward pass if you have an eos token in your input_ids
.
File "scripts/train.py", line 52, in main
ai.model(input_ids=batch, labels=batch, return_dict=False)
File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 941, in forward
transformer_outputs = self.transformer(
File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 731, in forward
inputs_embeds = self.wte(input_ids)
File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 156, in forward
return F.embedding(
File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/torch/nn/functional.py", line 1916, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
This is because if you look at the size of tokenizer.vocab
(len(tokenizer.vocab)
), it is actually 52001. This isn't an issue with the standard gpt2
model which you can download from huggingface. Its config has the eos_token_id
set to 50256 and vocab_size
set to 50257 (len(tokenizer.vocab)
is also 50257).
Hi @djwessel ,
thanks for reporting that issue!
I will have a closer look at this bug soon :) h
Hopefully it wouldn't require any re-training of already fine-tuned models...
Hey @stefan-it any updates here? Are you even using the eos token in your training of the models?
Hi @djwessel , I don't think it was used during training. However, I will re-train the model, because this is the second tokenizer issue and the last pretraining was only done using a batch size of 1 (because of strange XLA TPU memory consumption).
Thanks to the recent Hugging Face Community Week, training of GPT-2 works way better than 1 year ago 😅
Update: Training is working and will be finished in ~80 hours. It's a normal GPT-2 model trained on the full 16GB corpus, that I've used for our German DBMDZ BERT.
Hi @djwessel ,
the re-trained version of German GPT-2 is now available on the model hub!
You can just use the "old" identifiert dbmdz/german-gpt2
and the re-trained model will be downloaded/updated :hugs:
I still have this problem! model.config.vocab_size i set to 50265, but the tokenizer vocabulary has a length of 50266, with the EOS Token on index 50265. Can someone tell me how I could work around this?
Hey @LFruth ,
unfortunately, there is (or was) something wrong with the tokenizer training example code in the HF readme, resulting in this error:
In [8]: tokenizer.encode("Testsatz zu Ende <|endoftext|>")
Out[8]: [15538, 1029, 362, 1341, 225, 50265]
In [12]: tokenizer.encode("Testsatz zu Ende vriegel")
Out[12]: [15538, 1029, 362, 1341, 289, 50256]
According to config.json
the eos id should be 50256
but it "also" points to riegel
(as subword). However, generation results are looking good if you use the pipeline example from readme.
I will check the tokenizer training example again and will fill a bug report if the error still exists. Thanks for reporting!
Hi @stefan-it,
I'm trying to calculate surprisal scores from the outputs. The scores seem to be on the higher to very high side, as you can see from this minimal example:
import torch.nn.functional as F
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")
model = AutoModelForCausalLM.from_pretrained("dbmdz/german-gpt2")
inputs = tokenizer("Ich bin der Geist, der stets verneint! Und das mit Recht; denn alles, was entsteht, Ist wert, daß es zugrunde geht; Drum besser wär's, daß nichts entstünde.")
inputs = transformers.BatchEncoding(
{"input_ids": torch.tensor([50256] + inputs["input_ids"]),
"attention_mask": torch.tensor([1] + inputs["attention_mask"])}
)
output_ids = inputs["input_ids"][1:]
with torch.no_grad():
outputs = model(**inputs)
neglogprobs = -1*torch.log2(F.softmax(outputs.logits[:-1], dim=-1))
surprisal = neglogprobs[0, output_ids]
print(surprisal.numpy())
print(tokenizer.decode(output_ids))
The posts above pointed me to use 50256
as the bos token. However, and as you already mentioned, 50256
is decoded to riegel
, which makes me wonder if I'm doing this right. It would probably be better to use the <|endoftext|>
token for this, like the original GPT2.