fast-bert Does the multi-label classifier optimize the entire model or it freezes BERT?

I wonder if the default code optimizes the entire model end-to-end or just the additional classifier layer parameters get updated?

Thanks

Mar 26 '20 21:03 fabrahman

I noticed the freeze/requires_grad logic is commented in learner_cls. @kaushaltrivedi how does one freeze the bert layers and only train the added custom layer?

Apr 03 '20 13:04 aaronbriel

I tried adding a freeze_transformers_layer conditional in BertLearner's init function that set requires_grad to false for any param with the model_type in named_parameters but it didn't seem to have any effect. This approach worked for me in another implementation, resulting in exponentially reduced training times.

Apr 03 '20 15:04 aaronbriel

https://github.com/kaushaltrivedi/fast-bert/pull/195

Apr 14 '20 16:04 aaronbriel

just for clarification, freezing the layers means that there is a single linear layer (i.e. classifier head?) being trained for classification rather than all the layers being used to update the weights (of the input text) during training?

Apr 15 '20 03:04 lingdoc

That is correct. Also note that I did indeed confirm this by looping through and printing all layer names along with their requires_grad setting (summarized for brevity):

bert.embeddings.word_embeddings.weight, requires_grad:False bert.embeddings.position_embeddings.weight, requires_grad:False bert.embeddings.token_type_embeddings.weight, requires_grad:False ... bert.encoder.layer.11.output.LayerNorm.bias, requires_grad:False bert.pooler.dense.weight, requires_grad:False bert.pooler.dense.bias, requires_grad:False classifier.weight, requires_grad:True classifier.bias, requires_grad:True

Apr 15 '20 14:04 aaronbriel