fairseq2 icon indicating copy to clipboard operation
fairseq2 copied to clipboard

[LayerSkip] Per-Layer Dropout Rate Configuration

Open mostafaelhoushi opened this issue 7 months ago • 0 comments

Describe the solution you would like: Would like to enable configuration of a different layer dropout rate for each layer.

Describe the alternatives you have considered: Currently, layer dropout is implemented in fairseq2 as a scalar probability for all layers (check here). We can follow an implementation similar to this PR in torchtune to support linear, exponential, or step configurations for increasing dropout rate acorss layers.

Additional Context: This will enable implementing:

  • Progressive Layer Dropping: that claims to increase accuracy and speed of training if dropout rate increases across layers linearly
  • LayerSkip: that claims to increase accuracy of early exit layers if dropout rate increaes linearly or exponentially across layers

mostafaelhoushi avatar Jul 08 '24 14:07 mostafaelhoushi