ALBERT-TF2.0 icon indicating copy to clipboard operation
ALBERT-TF2.0 copied to clipboard

Getting huge number training steps

Open 008karan opened this issue 4 years ago • 0 comments

I have generated pretraining data using the given steps in this repo. I am doing this for the Hindi language with 22gb of data. Generating pretraining data itself took 1 month! So I have meta_data file associated with each tf.record file. I have added all the train_data_size values from all the meta_data files to make one meta_data file because in run_pretraining.py requires it. So my final meta_data file which looks something like this:

{
    "task_type": "albert_pretraining",
    "train_data_size": 596972848,
    "max_seq_length": 512,
    "max_predictions_per_seq": 20
}

Here number of training steps are calculated as below:

num_train_steps = int(total_train_examples / train_batch_size) * num_train_epochs

So total_train_examples is 596972848 hence I am getting num_train_steps to be 9327700 with batch size of 64 and with 1 epoch only. I saw that in readme here num_train_steps=125000. I am not getting whats went wrong here.

With such huge train steps, it will take forever to train Albert. Even if I make batch size to 512 with 1 epoch only the training step will be 1165962 which is still huge! As Albert was trained on very huge data why there are only 125000 steps only? Want to know-how many epochs are there in Albert training for English?

Can anyone suggest what went wrong and what should I do now?

008karan avatar Mar 05 '20 16:03 008karan