albert icon indicating copy to clipboard operation
albert copied to clipboard

Getting huge number of training steps

Open 008karan opened this issue 4 years ago • 5 comments

I have generated pretraining data using https://github.com/kamalkraj/ALBERT-TF2.0 because this supports training with multi GPU. I am doing this for the Hindi language with 22gb of data. Generating pretraining data itself took 1 month! So I have meta_data file associated with each tf.record file. I have added all the train_data_size values from all the meta_data files to make one meta_data file because in run_pretraining.py requires it. So my final meta_data file which looks something like this:

{
    "task_type": "albert_pretraining",
    "train_data_size": 596972848,
    "max_seq_length": 512,
    "max_predictions_per_seq": 20
}

Here number of training steps are calculated as below:

num_train_steps = int(total_train_examples / train_batch_size) * num_train_epochs

So total_train_examples is 596972848 hence I am getting num_train_steps to be 9327700 with batch size of 64 and with 1 epoch only. I saw that in readme here num_train_steps=125000. I am not getting whats went wrong here.

With such huge train steps, it will take forever to train Albert. Even if I make batch size to 512 with 1 epoch only the training step will be 1165962 which is still huge! As Albert was trained on very huge data why there are only 125000 steps only? Want to know-how many epochs are there in Albert training for English?

Can anyone suggest what should I do now?

008karan avatar Mar 05 '20 15:03 008karan

"To keep the comparison as meaningful as possible, we follow the BERT (Devlin et al., 2019) setup inusing the BookCorpus(Zhu et al., 2015) and English Wikipedia (Devlin et al., 2019) for pretraining baseline models." Albert

So i guess they also follow the BERTS A.2 Pretraining Procedure Devlin et. al..

We train with batch size of 256 sequences (256sequences * 512 tokens = 128,000 tokens/batch)for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.

Hope it helps.

arrrrrmin avatar Mar 05 '20 18:03 arrrrrmin

@arrrrrmin thanks for pointing this out. I think in my case I need to regenerate training data by reducing the max_sequence_length which will lower down the total_train_examples. Is there any method to speed up the data preparation process? Also will reducing max_predictions_per_seq help? Do you have any suggestions?

008karan avatar Mar 05 '20 18:03 008karan

I think your calculation is correct. The original ALBERT was trained using batch size 4096(as specified in their paper), that is the reason behind using LAMB and why it only took 125000 steps to train. That being said, since you are updating far more frequently with a smaller batch size, it shouldn't take a full epoch to converge.

Also if all you care is multi-gpu training, there's a script in the XLNet repository that does exactly that, you just need to change some graph definition and optimizer things, everything related to ALBERT's input pipeline can be used as is.

illuminascent avatar Mar 06 '20 01:03 illuminascent

Still, I need to complete at least 1 epoch to pass whole data through the model, isn't it?

008karan avatar Mar 06 '20 07:03 008karan

@008karan If you haven't done full shuffle on your data -> yes. Otherwise any subset of the training dataset shall represent the whole set well enough and its perfectly fine to stop in short of a complete epoch. Google has done similar thing when training T5, because the C4 dataset it too big to cover entirely.

illuminascent avatar Mar 06 '20 08:03 illuminascent