ALBERT-TF2.0
ALBERT-TF2.0 copied to clipboard
Getting huge number training steps
I have generated pretraining data using the given steps in this repo.
I am doing this for the Hindi language with 22gb of data. Generating pretraining data itself took 1 month!
So I have meta_data
file associated with each tf.record file. I have added all the train_data_size
values from all the meta_data
files to make one meta_data
file because in run_pretraining.py
requires it. So my final meta_data
file which looks something like this:
{
"task_type": "albert_pretraining",
"train_data_size": 596972848,
"max_seq_length": 512,
"max_predictions_per_seq": 20
}
Here number of training steps are calculated as below:
num_train_steps = int(total_train_examples / train_batch_size) * num_train_epochs
So total_train_examples
is 596972848 hence I am getting num_train_steps
to be 9327700 with batch size of 64 and with 1 epoch only. I saw that in readme here num_train_steps=125000
. I am not getting whats went wrong here.
With such huge train steps, it will take forever to train Albert. Even if I make batch size to 512 with 1 epoch only the training step will be 1165962 which is still huge! As Albert was trained on very huge data why there are only 125000 steps only? Want to know-how many epochs are there in Albert training for English?
Can anyone suggest what went wrong and what should I do now?