transformer-models
transformer-models copied to clipboard
How to train a BERT model from scratch
How can I train a BERT model from scratch?
We don't have any example code for this, but it is possible. You'll need to do a few things:
- Get a dataset to train on. The original BERT uses Wikipedia and BookCorpus.
- Set up the pre-training tasks. See section 3.1 of the paper or the create_training_data.py for how this was done in Python.
- Initialize learnable parameters. To use our
bert.modelimplementation you need astructof parameters in the same format as theParametersfield of thestructthatbert()returns. The original weights initialization scheme details are in modeling.py. - Write the pretraining script. Similar to
FineTuneBERT.m, however you'll need to tweak the configuration (minibatch size, number of epochs, learn rate) and to replicate run_pretraining.py you need additional things in the training loop such as learn rate warmup/decay, and gradient clipping (as in optimization.py). If you're attempting to train at scale you probably want to adapt the training loop to use multiple GPUs as in this example.
It is worth being aware of the conclusions in section 5 of RoBERTa, particularly for setting up the pre-training tasks in step 2.
There's definitely a lot of work here, so I think we should keep this issue open as an enhancement to add a pre-training script.
We don't have any example code for this, but it is possible. You'll need to do a few things:
- Get a dataset to train on. The original BERT uses Wikipedia and BookCorpus.
- Set up the pre-training tasks. See section 3.1 of the paper or the create_training_data.py for how this was done in Python.
- Initialize learnable parameters. To use our
bert.modelimplementation you need astructof parameters in the same format as theParametersfield of thestructthatbert()returns. The original weights initialization scheme details are in modeling.py.- Write the pretraining script. Similar to
FineTuneBERT.m, however you'll need to tweak the configuration (minibatch size, number of epochs, learn rate) and to replicate run_pretraining.py you need additional things in the training loop such as learn rate warmup/decay, and gradient clipping (as in optimization.py). If you're attempting to train at scale you probably want to adapt the training loop to use multiple GPUs as in this example.It is worth being aware of the conclusions in section 5 of RoBERTa, particularly for setting up the pre-training tasks in step 2.
There's definitely a lot of work here, so I think we should keep this issue open as an enhancement to add a pre-training script.
The given example uses bert model from a pretrained struct. How to create a new bert model but not load from the pretrained? As the step3? but "the bert() returns" still uses the loaded struct, so how to get the struct of parameters in the same format as the Parameters field of the struct that bert() returns? define the struct first? any scripts are appropriate.
If you can use the same initializer for every parameter then the quickest thing you can do is something like:
mdl = bert;
% write an initializer function that
% takes an existing dlarray parameter as input
% and returns a dlarray parameter of the same size.
initializer = @(w) 0.1*randn(size(w),"like",w);
mdl.Parameters.Weights = dlupdate(initializer,mdl.Parameters.Weights);
This is a little limited if you need to do something like use different initializers for the embeddings, linear layers, layer norms, etc. For that case I would write a suite of functions to initialize the struct, that might start like:
function weights = initializeBert()
weights = struct(...
"embeddings",initializeEmbeddings(),...
"encoder_layers",initializeEncoderLayers());
end
function weights = initializeEmbeddings()
% The numbers here are sizes from bert-base
weights = struct(...
"LayerNorm", initializeLayerNorm(768),...
"position_embeddings", initializeEmbedding(768,512),...
"token_type_embeddings", initializeEmbedding(768,2),...
"word_embeddings", initializeEmbedding(768,30522));
end
% etc.
You have to implement initializeEmbedding, initializeLayerNorm, and initializeEncoderLayers. It takes some time, luckily each encoder layer is the same structure so you can just write a loop to initialize those with a single initializeEncoderLayer implementation.
If you can use the same initializer for every parameter then the quickest thing you can do is something like:
mdl = bert; % write an initializer function that % takes an existing dlarray parameter as input % and returns a dlarray parameter of the same size. initializer = @(w) 0.1*randn(size(w),"like",w); mdl.Parameters.Weights = dlupdate(initializer,mdl.Parameters.Weights);This is a little limited if you need to do something like use different initializers for the embeddings, linear layers, layer norms, etc. For that case I would write a suite of functions to initialize the struct, that might start like:
function weights = initializeBert() weights = struct(... "embeddings",initializeEmbeddings(),... "encoder_layers",initializeEncoderLayers()); end function weights = initializeEmbeddings() % The numbers here are sizes from bert-base weights = struct(... "LayerNorm", initializeLayerNorm(768),... "position_embeddings", initializeEmbedding(768,512),... "token_type_embeddings", initializeEmbedding(768,2),... "word_embeddings", initializeEmbedding(768,30522)); end % etc.You have to implement
initializeEmbedding,initializeLayerNorm, andinitializeEncoderLayers. It takes some time, luckily each encoder layer is the same structure so you can just write a loop to initialize those with a singleinitializeEncoderLayerimplementation.
Thanks, it is clear. As the bert model has been trained, there should include the model creating scripts. I wonder why the demo does not give the function for model creating. More generally, if we create a different transformer model, should we need to implement something like createParameterStruct()? Demos including creating general transformer model functions would be helpful for popularizing.
For the record, we didn't train this ourselves, we imported the pre-trained weights for the original BERT models. That's why we didn't need to initialize the model ourselves, and don't have a nice pre-training demo.
I agree it would be nice for us to add initializer functions for the parameters that the layers in transformer.layer need - typically we would rely on built-in layers and dlnetwork to handle this for us.
Could you describe what you mean by a general transformer? I know of the BERT encoder-only type, GPT-2 decoder-only type, and encoder-decoder type like the original. Is there something else beyond those?
For the record, we didn't train this ourselves, we imported the pre-trained weights for the original BERT models. That's why we didn't need to initialize the model ourselves, and don't have a nice pre-training demo.
I agree it would be nice for us to add initializer functions for the parameters that the layers in
transformer.layerneed - typically we would rely on built-in layers anddlnetworkto handle this for us.Could you describe what you mean by a general transformer? I know of the BERT encoder-only type, GPT-2 decoder-only type, and encoder-decoder type like the original. Is there something else beyond those?
sorry for the unclear, "general transformer" means to create custom transformer models using the basic modules just like creating kinds of cnn networks.