gpt-2-Pytorch icon indicating copy to clipboard operation
gpt-2-Pytorch copied to clipboard

training

Open armoreal opened this issue 5 years ago • 10 comments

Is there's any way to train GPT2 using my own text corpus?

armoreal avatar Feb 24 '19 08:02 armoreal

@armoreal Which language do you want? Is it English?

graykode avatar Feb 24 '19 08:02 graykode

In russian.

armoreal avatar Feb 24 '19 08:02 armoreal

@armoreal First, Existing gpt-2 models are only supported in English. https://github.com/openai/gpt-2/issues/31 If you want to train your language, I recommend you to read original gpt, gpt-2 paper. Please See Improving Language Understanding by Generative Pre-Training, 3-1. Unsupervised pre-training and 3-2. Supervised fine-tuning! https://github.com/eukaryote31/openwebtext In here, you can also see GPT-2 WebText dataset.

graykode avatar Feb 24 '19 08:02 graykode

Thanks for your reply. As far as i understand, GPT2 were trained on english and that's the reason why it doesn't support other languages, but I'd like to try to train it on other languages using my own dataset. OpenAI reply about training: https://github.com/openai/gpt-2/issues/19 So it's possible, but they didn't planning to release the code yet.

armoreal avatar Feb 24 '19 09:02 armoreal

@armoreal I think this repository can be trainable https://github.com/openai/finetune-transformer-lm but, There are no dataset related to your langauge and computer resource I think.. In gpt-2 paper, they explained what is different gpt between gpt-2. It will be problem at training, dataset(including how they pre-processing) and computer computer image

graykode avatar Feb 24 '19 10:02 graykode

@armoreal See code and paper more detail image

  1. Text-Predict in here : https://github.com/openai/finetune-transformer-lm/blob/master/train.py#L176, 3.1 Unsupervisedpre-training
  2. Task classification in here : https://github.com/openai/finetune-transformer-lm/blob/master/train.py#L193, 3.2 Supervisedfine-tuning

L3(C) = L2(C) + λ∗L1(C) https://github.com/openai/finetune-transformer-lm/blob/master/train.py#L205

graykode avatar Feb 24 '19 10:02 graykode

Overall, There is code related with training. so you can train. BUT Dataset and Computer power maybe problem :(

Please do not close this issue for everyone!

graykode avatar Feb 24 '19 10:02 graykode

Same question. Thank you.

guotong1988 avatar Feb 25 '19 22:02 guotong1988

Is there a way to finetune this GPT-2 implementation on my own English corpus?

robertmacyiii avatar Apr 25 '19 17:04 robertmacyiii

I would like to fine tune pytorch gpt2 on an English corpus. Is the openai code pytorch or tf? Are there examples online in pytorch?

radiodee1 avatar Jun 01 '19 19:06 radiodee1