DecisionTransformerInterpretability
DecisionTransformerInterpretability copied to clipboard
Investigate the effect of Dropout / Stochastic Depth on Model training/interpretability
From Gato paper: "Regularization: We train with an AdamW weight decay parameter of 0.1. Additionally, we use stochastic depth (Huang et al., 2016) during pretraining, where each of the transformer sub-layers (i.e. each Multi-Head Attention and Dense Feedforward layer) is skipped with a probability of 0.1."
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. Preprint arXiv:1603.09382, 2016.
Stochastic depth seems plausibly super valuable to me via intutions. I should read that paper at some point - Joseph