german-gpt2 icon indicating copy to clipboard operation
german-gpt2 copied to clipboard

Small documentation example of how to fine-tune the model on German texts for generating text purposes

Open toltoxgh opened this issue 4 years ago • 2 comments

It is good to see German GPT models, thanks for this.

At https://huggingface.co/dbmdz/german-gpt2 you write that "The model is meant to be an entry point for fine-tuning on other texts..."

Would it be possible for you to provide a small, complete example of how to do this with German texts for generating text purposes? For example, if extra_texts is a list of domain specific texts I would like to fine-tune the model on, how could your provided example be extended to do this:

    from transformers import pipeline
    
    pipe = pipeline('text-generation', model="dbmdz/german-gpt2",
                     tokenizer="dbmdz/german-gpt2")
    
    extra_texts = ... # a list of texts to fine tune the model on
    
    ... # fine tune model on extra_texts
    
    text = pipe("Der Sinn des Lebens ist es", max_length=100)[0]["generated_text"]
    
    print(text)

toltoxgh avatar Jan 28 '21 02:01 toltoxgh

Hi @toliwa ,

for fine-tuning I used the official script from Transformers:

https://github.com/huggingface/transformers/tree/master/examples/language-modeling#gpt-2gpt-and-causal-language-modeling

A good example of fine-tuning can be found here:

https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb

(Yes, it trains a model from scratch, but you can skip that and directly have a look at this section.

Hope that helps :hugs:

stefan-it avatar Feb 02 '21 15:02 stefan-it

Hi @stefan-it. Happy to see the surprisingly cool work of the "Bayerische Staatsbibliothek" :)

I'd like to fine-tune the german-gpt2 model on some german text. Any indication of how much data is needed to get some useful results? I'm having around 900 documents with average sequence length around 1500. I need to rent a GPU for training, so I wanted to check before paying :)

Cheers!

jonas-nothnagel avatar Mar 10 '21 14:03 jonas-nothnagel