german-gpt2 Incorrect vocab size

The vocab_size in the config.json is set to 52000, however, the eos_token_id is also set to 52000, which causes an IndexError in the forward pass if you have an eos token in your input_ids.

File "scripts/train.py", line 52, in main
    ai.model(input_ids=batch, labels=batch, return_dict=False)
  File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 941, in forward
    transformer_outputs = self.transformer(
  File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 731, in forward
    inputs_embeds = self.wte(input_ids)
  File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 156, in forward
    return F.embedding(
  File "/home/ai/projects/discovery/schlager-ai-2/.venv/lib/python3.8/site-packages/torch/nn/functional.py", line 1916, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

This is because if you look at the size of tokenizer.vocab (len(tokenizer.vocab)), it is actually 52001. This isn't an issue with the standard gpt2 model which you can download from huggingface. Its config has the eos_token_id set to 50256 and vocab_size set to 50257 (len(tokenizer.vocab) is also 50257).

Jun 11 '21 08:06 djwessel

Hi @djwessel ,

thanks for reporting that issue!

I will have a closer look at this bug soon :) h

Hopefully it wouldn't require any re-training of already fine-tuned models...

Jun 24 '21 19:06 stefan-it

Hey @stefan-it any updates here? Are you even using the eos token in your training of the models?

Jul 23 '21 09:07 djwessel

Hi @djwessel , I don't think it was used during training. However, I will re-train the model, because this is the second tokenizer issue and the last pretraining was only done using a batch size of 1 (because of strange XLA TPU memory consumption).

Thanks to the recent Hugging Face Community Week, training of GPT-2 works way better than 1 year ago 😅

Jul 27 '21 07:07 stefan-it

Update: Training is working and will be finished in ~80 hours. It's a normal GPT-2 model trained on the full 16GB corpus, that I've used for our German DBMDZ BERT.

Jul 31 '21 20:07 stefan-it

Hi @djwessel ,

the re-trained version of German GPT-2 is now available on the model hub!

You can just use the "old" identifiert dbmdz/german-gpt2 and the re-trained model will be downloaded/updated :hugs:

Aug 17 '21 08:08 stefan-it

I still have this problem! model.config.vocab_size i set to 50265, but the tokenizer vocabulary has a length of 50266, with the EOS Token on index 50265. Can someone tell me how I could work around this?

Jan 07 '22 15:01 LFruth

Hey @LFruth ,

unfortunately, there is (or was) something wrong with the tokenizer training example code in the HF readme, resulting in this error:

In [8]: tokenizer.encode("Testsatz zu Ende <|endoftext|>")
Out[8]: [15538, 1029, 362, 1341, 225, 50265]

In [12]: tokenizer.encode("Testsatz zu Ende vriegel")
Out[12]: [15538, 1029, 362, 1341, 289, 50256]

According to config.json the eos id should be 50256 but it "also" points to riegel (as subword). However, generation results are looking good if you use the pipeline example from readme.

I will check the tokenizer training example again and will fill a bug report if the error still exists. Thanks for reporting!

Jan 07 '22 16:01 stefan-it

Hi @stefan-it,

I'm trying to calculate surprisal scores from the outputs. The scores seem to be on the higher to very high side, as you can see from this minimal example:

import torch.nn.functional as F
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")
model = AutoModelForCausalLM.from_pretrained("dbmdz/german-gpt2")

inputs = tokenizer("Ich bin der Geist, der stets verneint! Und das mit Recht; denn alles, was entsteht, Ist wert, daß es zugrunde geht; Drum besser wär's, daß nichts entstünde.")
inputs = transformers.BatchEncoding(
    {"input_ids": torch.tensor([50256] + inputs["input_ids"]),
     "attention_mask": torch.tensor([1] + inputs["attention_mask"])}
)
output_ids = inputs["input_ids"][1:]
with torch.no_grad():
    outputs = model(**inputs)
    neglogprobs = -1*torch.log2(F.softmax(outputs.logits[:-1], dim=-1))
    surprisal = neglogprobs[0, output_ids]
    print(surprisal.numpy())
    print(tokenizer.decode(output_ids))

The posts above pointed me to use 50256 as the bos token. However, and as you already mentioned, 50256 is decoded to riegel, which makes me wonder if I'm doing this right. It would probably be better to use the <|endoftext|> token for this, like the original GPT2.

May 17 '23 13:05 justeuer

german-gpt2 german-gpt2 copied to clipboard

Incorrect vocab size

german-gpt2
german-gpt2 copied to clipboard