s3prl icon indicating copy to clipboard operation
s3prl copied to clipboard

How to extend layer numbers of AALBERT?

Open myhrbeu opened this issue 2 years ago • 3 comments

I have another question, AALBERT just has 3 Transformer layers, and I want to extend it to 7 or more layers in my experiment. In VGG and other CNN models, it is easy, but in AALBERT I found it's hard. I tried several methods by debugging codes to extend layers, but all failed! It's because when I get every layer out in the first several layers, the Transformer layers needs but I can not got right pos_enc and attn_mask. Do you have an easy way or suggestion to solve this problem? Thanks a lot! @leo19941227

myhrbeu avatar May 10 '22 04:05 myhrbeu

Hi @myhrbeu,

I'm not sure if I understand your question correctly, but you can change the number of layers by changing this line: https://github.com/s3prl/s3prl/blob/e52439edaeb1a443e82960e6401ae6ab4241def6/s3prl/pretrain/audio_albert/config_model.yaml#L4

For example:

num_hidden_layers: 7

FYI, the share layer mechanism is implemented here: https://github.com/s3prl/s3prl/blob/e52439edaeb1a443e82960e6401ae6ab4241def6/s3prl/upstream/mockingjay/model.py#L317-L320

andi611 avatar May 10 '22 11:05 andi611

Aha! That's an easy way to do my experiments! Thanks a lot! And I used debugging code for several hours to achieve this target through the way like transfer-learning, and I found it is very slow when having 7 layers. I have two questions: 1、If I use AALBERT and use config.share_layers=False, is it means every layer begins with the same weights and is different at the end? 2、Is the Transformer layer in AALBERT the same as used in NLP tasks? I want to use the pre-trained weights for the downstream tasks, so I think loading the weights to my own model, will make my experiments much easier. Thanks!

myhrbeu avatar May 10 '22 13:05 myhrbeu

I found it is very slow when having 7 layers

This is expected and normal. Because sharing layers (sharing weights across layers) won't make the model forward faster. you still need to forward through 7 layers and compute all the gradients of each layer.

If you read the original ALBERT paper, you will find that it is NOT sharing layers that make ALBERT faster than BERT. But there are other factors at play.

To answer your two questions:

  1. Yes.
  2. They are the same but implementations might be slightly different, with some engineering you should be able to copy the weights to your own model.

andi611 avatar May 10 '22 13:05 andi611