s3prl
s3prl copied to clipboard
How to extend layer numbers of AALBERT?
I have another question, AALBERT just has 3 Transformer layers, and I want to extend it to 7 or more layers in my experiment. In VGG and other CNN models, it is easy, but in AALBERT I found it's hard. I tried several methods by debugging codes to extend layers, but all failed! It's because when I get every layer out in the first several layers, the Transformer layers needs but I can not got right pos_enc and attn_mask. Do you have an easy way or suggestion to solve this problem? Thanks a lot! @leo19941227
Hi @myhrbeu,
I'm not sure if I understand your question correctly, but you can change the number of layers by changing this line: https://github.com/s3prl/s3prl/blob/e52439edaeb1a443e82960e6401ae6ab4241def6/s3prl/pretrain/audio_albert/config_model.yaml#L4
For example:
num_hidden_layers: 7
FYI, the share layer mechanism is implemented here: https://github.com/s3prl/s3prl/blob/e52439edaeb1a443e82960e6401ae6ab4241def6/s3prl/upstream/mockingjay/model.py#L317-L320
Aha! That's an easy way to do my experiments! Thanks a lot! And I used debugging code for several hours to achieve this target through the way like transfer-learning, and I found it is very slow when having 7 layers. I have two questions: 1、If I use AALBERT and use config.share_layers=False, is it means every layer begins with the same weights and is different at the end? 2、Is the Transformer layer in AALBERT the same as used in NLP tasks? I want to use the pre-trained weights for the downstream tasks, so I think loading the weights to my own model, will make my experiments much easier. Thanks!
I found it is very slow when having 7 layers
This is expected and normal. Because sharing layers (sharing weights across layers) won't make the model forward faster. you still need to forward through 7 layers and compute all the gradients of each layer.
If you read the original ALBERT paper, you will find that it is NOT sharing layers that make ALBERT faster than BERT. But there are other factors at play.
To answer your two questions:
- Yes.
- They are the same but implementations might be slightly different, with some engineering you should be able to copy the weights to your own model.