transformer-models How to train a BERT model from scratch

How can I train a BERT model from scratch?

Mar 16 '22 07:03 dwjung1

We don't have any example code for this, but it is possible. You'll need to do a few things:

Get a dataset to train on. The original BERT uses Wikipedia and BookCorpus.
Set up the pre-training tasks. See section 3.1 of the paper or the create_training_data.py for how this was done in Python.
Initialize learnable parameters. To use our bert.model implementation you need a struct of parameters in the same format as the Parameters field of the struct that bert() returns. The original weights initialization scheme details are in modeling.py.
Write the pretraining script. Similar to FineTuneBERT.m, however you'll need to tweak the configuration (minibatch size, number of epochs, learn rate) and to replicate run_pretraining.py you need additional things in the training loop such as learn rate warmup/decay, and gradient clipping (as in optimization.py). If you're attempting to train at scale you probably want to adapt the training loop to use multiple GPUs as in this example.

It is worth being aware of the conclusions in section 5 of RoBERTa, particularly for setting up the pre-training tasks in step 2.

There's definitely a lot of work here, so I think we should keep this issue open as an enhancement to add a pre-training script.

Mar 16 '22 10:03 bwdGitHub

We don't have any example code for this, but it is possible. You'll need to do a few things:

Get a dataset to train on. The original BERT uses Wikipedia and BookCorpus.

Set up the pre-training tasks. See section 3.1 of the paper or the create_training_data.py for how this was done in Python.

Initialize learnable parameters. To use our bert.model implementation you need a struct of parameters in the same format as the Parameters field of the struct that bert() returns. The original weights initialization scheme details are in modeling.py.

Write the pretraining script. Similar to FineTuneBERT.m, however you'll need to tweak the configuration (minibatch size, number of epochs, learn rate) and to replicate run_pretraining.py you need additional things in the training loop such as learn rate warmup/decay, and gradient clipping (as in optimization.py). If you're attempting to train at scale you probably want to adapt the training loop to use multiple GPUs as in this example.

It is worth being aware of the conclusions in section 5 of RoBERTa, particularly for setting up the pre-training tasks in step 2.

There's definitely a lot of work here, so I think we should keep this issue open as an enhancement to add a pre-training script.

The given example uses bert model from a pretrained struct. How to create a new bert model but not load from the pretrained? As the step3? but "the bert() returns" still uses the loaded struct, so how to get the struct of parameters in the same format as the Parameters field of the struct that bert() returns? define the struct first? any scripts are appropriate.

Apr 01 '22 10:04 micklexqg

If you can use the same initializer for every parameter then the quickest thing you can do is something like:

mdl = bert;
% write an initializer function that 
% takes an existing dlarray parameter as input
% and returns a dlarray parameter of the same size.
initializer = @(w) 0.1*randn(size(w),"like",w);
mdl.Parameters.Weights = dlupdate(initializer,mdl.Parameters.Weights);

This is a little limited if you need to do something like use different initializers for the embeddings, linear layers, layer norms, etc. For that case I would write a suite of functions to initialize the struct, that might start like:

function weights = initializeBert()
weights = struct(...
  "embeddings",initializeEmbeddings(),...
  "encoder_layers",initializeEncoderLayers());
end

function weights = initializeEmbeddings()

% The numbers here are sizes from bert-base
weights = struct(...
  "LayerNorm", initializeLayerNorm(768),...
  "position_embeddings", initializeEmbedding(768,512),...
  "token_type_embeddings", initializeEmbedding(768,2),...
  "word_embeddings", initializeEmbedding(768,30522));
end

% etc.

You have to implement initializeEmbedding, initializeLayerNorm, and initializeEncoderLayers. It takes some time, luckily each encoder layer is the same structure so you can just write a loop to initialize those with a single initializeEncoderLayer implementation.

Apr 01 '22 11:04 bwdGitHub

If you can use the same initializer for every parameter then the quickest thing you can do is something like:
mdl = bert;
% write an initializer function that 
% takes an existing dlarray parameter as input
% and returns a dlarray parameter of the same size.
initializer = @(w) 0.1*randn(size(w),"like",w);
mdl.Parameters.Weights = dlupdate(initializer,mdl.Parameters.Weights);
This is a little limited if you need to do something like use different initializers for the embeddings, linear layers, layer norms, etc. For that case I would write a suite of functions to initialize the struct, that might start like:
function weights = initializeBert()
weights = struct(...
  "embeddings",initializeEmbeddings(),...
  "encoder_layers",initializeEncoderLayers());
end

function weights = initializeEmbeddings()

% The numbers here are sizes from bert-base
weights = struct(...
  "LayerNorm", initializeLayerNorm(768),...
  "position_embeddings", initializeEmbedding(768,512),...
  "token_type_embeddings", initializeEmbedding(768,2),...
  "word_embeddings", initializeEmbedding(768,30522));
end

% etc.
You have to implement initializeEmbedding, initializeLayerNorm, and initializeEncoderLayers. It takes some time, luckily each encoder layer is the same structure so you can just write a loop to initialize those with a single initializeEncoderLayer implementation.

Thanks, it is clear. As the bert model has been trained, there should include the model creating scripts. I wonder why the demo does not give the function for model creating. More generally, if we create a different transformer model, should we need to implement something like createParameterStruct()? Demos including creating general transformer model functions would be helpful for popularizing.

Apr 02 '22 02:04 micklexqg

For the record, we didn't train this ourselves, we imported the pre-trained weights for the original BERT models. That's why we didn't need to initialize the model ourselves, and don't have a nice pre-training demo.

I agree it would be nice for us to add initializer functions for the parameters that the layers in transformer.layer need - typically we would rely on built-in layers and dlnetwork to handle this for us.

Could you describe what you mean by a general transformer? I know of the BERT encoder-only type, GPT-2 decoder-only type, and encoder-decoder type like the original. Is there something else beyond those?

Apr 06 '22 09:04 bwdGitHub

For the record, we didn't train this ourselves, we imported the pre-trained weights for the original BERT models. That's why we didn't need to initialize the model ourselves, and don't have a nice pre-training demo.

I agree it would be nice for us to add initializer functions for the parameters that the layers in transformer.layer need - typically we would rely on built-in layers and dlnetwork to handle this for us.

Could you describe what you mean by a general transformer? I know of the BERT encoder-only type, GPT-2 decoder-only type, and encoder-decoder type like the original. Is there something else beyond those?

sorry for the unclear, "general transformer" means to create custom transformer models using the basic modules just like creating kinds of cnn networks.

Apr 07 '22 11:04 micklexqg

transformer-models transformer-models copied to clipboard

How to train a BERT model from scratch

transformer-models
transformer-models copied to clipboard