german-gpt2
german-gpt2 copied to clipboard
Small documentation example of how to fine-tune the model on German texts for generating text purposes
It is good to see German GPT models, thanks for this.
At https://huggingface.co/dbmdz/german-gpt2 you write that "The model is meant to be an entry point for fine-tuning on other texts..."
Would it be possible for you to provide a small, complete example of how to do this with German texts for generating text purposes? For example, if extra_texts is a list of domain specific texts I would like to fine-tune the model on, how could your provided example be extended to do this:
from transformers import pipeline
pipe = pipeline('text-generation', model="dbmdz/german-gpt2",
tokenizer="dbmdz/german-gpt2")
extra_texts = ... # a list of texts to fine tune the model on
... # fine tune model on extra_texts
text = pipe("Der Sinn des Lebens ist es", max_length=100)[0]["generated_text"]
print(text)
Hi @toliwa ,
for fine-tuning I used the official script from Transformers:
https://github.com/huggingface/transformers/tree/master/examples/language-modeling#gpt-2gpt-and-causal-language-modeling
A good example of fine-tuning can be found here:
https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
(Yes, it trains a model from scratch, but you can skip that and directly have a look at this section.
Hope that helps :hugs:
Hi @stefan-it. Happy to see the surprisingly cool work of the "Bayerische Staatsbibliothek" :)
I'd like to fine-tune the german-gpt2 model on some german text. Any indication of how much data is needed to get some useful results? I'm having around 900 documents with average sequence length around 1500. I need to rent a GPU for training, so I wanted to check before paying :)
Cheers!