bert icon indicating copy to clipboard operation
bert copied to clipboard

How many articles (Wiki+Book corpus) do Bert use in pretraining?

Open Qinzhun opened this issue 5 years ago • 6 comments

In the article "Bert: Pretraining of Deep..", It mentions that Wikipedia and Book corpus dataset are used to pretrain. When I try to generate my own data with Wikipedia, I get about 5.5 million articles, and get about 15 million examples with tokens length 512 using the script create_pretraining_data.py.

In the article "Bert: Pretraining of Deep..", it mentions 1000000 steps for 40 epochs, with batch size 256, which means 6.4 million examples for pretraining (wiki+bookcorpus). They are very different from my results. So i am coufused whether there are some other measures taken to process the Wikipedia data such as filtering the articles whose length is less than XX ? Or if I use these 15 million examples for pretraining, whether there is a significant influence to my result?

I am thankful to anyone helps~

Qinzhun avatar Apr 11 '19 02:04 Qinzhun

Hi, I meet the same problem as you. I process the corpus with the pytorch version implementation by huggingface (https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning). They claim that it is implemented according to "create_pretraining_data.py" in tensorflow.

I get 18M training examples with the maximum length 512, I am also confused the preprocessing. I think that the training examples should be larger than the number of documents. The documents in wiki+book is larger than the training examples in BERT paper.

Have you solved the problem? If you have, could you please share how to solve the problem? Thanks you very much. @Qinzhun

DecstionBack avatar Apr 23 '19 08:04 DecstionBack

Hi, I meet the same problem as you. I process the corpus with the pytorch version implementation by huggingface (https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning). They claim that it is implemented according to "create_pretraining_data.py" in tensorflow.

I get 18M training examples with the maximum length 512, I am also confused the preprocessing. I think that the training examples should be larger than the number of documents. The documents in wiki+book is larger than the training examples in BERT paper.

Have you solved the problem? If you have, could you please share how to solve the problem? Thanks you very much. @Qinzhun

Sorry, I haven't solve this problem.And when I try to use a new frame Mxnet who declares that they can finish pretraining with 8 GPU in 6.5 days, I find the similar problem we discussed here. They also use less training examples than us, which maybe the similar size with the BERT paper. @DecstionBack

Qinzhun avatar Apr 30 '19 09:04 Qinzhun

Hi, @Qinzhun. I'm not sure, but I think that a dupe_factor, one of the hyperparameters in create_pretraining_data.py, causes that problem. It is the number of times to duplicate the input data (with different masks).

https://github.com/google-research/bert/blob/d66a146741588fb208450bde15aa7db143baaa69/create_pretraining_data.py#L53

roomylee avatar May 09 '19 12:05 roomylee

I also run the create_pretraining_data.py script. My input is the Wikipedia data (12G), there are total 5,684,250 documents. First I split the dataset into 10 smaller files using split command. Then for each file, run the script with dupe_factor = 1, max_seq_len = 128. Finally, I got a training dataset with 33,236,250 instances.

I also check the total words in my Wikipedia data, using the wc command. It shows out the dataset contains 2,010,692,529 words and 110,819,655 lines. This number is less than the number reported in the BERT paper (2500M).

I was very confused by the definition of one epoch used during the pre-training procedure. In my understanding, use dupe_factor = 1 gives one epoch of training set, using dupe_factor = 5 gives five epoch of training sets. Is this understanding correct?

songsuoyuan avatar Mar 06 '20 03:03 songsuoyuan

I have a similar problem...

JF-D avatar Apr 15 '20 13:04 JF-D

@Qinzhun Did you solve the problem? for one of my projects I am trying to replicate BERT pre-training data

akanyaani avatar Sep 26 '22 11:09 akanyaani