xlnet
xlnet copied to clipboard
no lr_layer_decay_rate for embedding
Thanks for your work. I found that there is no lr_layer_decay_rate for embedding layer, which is weird because embedding is actually below transformer layers.
Here is a PR, FYI. https://github.com/zihangdai/xlnet/pull/93
Does this mean the learning rate decay was only applied to the 24 transformer layers? Not to the embedding layers or the dense layers for start and end logits? I'm trying to reproduce the paper results in PyTorch. @ymcui @zihangdai
@hlums
see: https://github.com/zihangdai/xlnet/blob/master/model_utils.py#L149
As the code reveals, currently, lr_layer_decay
is only applied to transformer layers, but not to the other parts (embedding, etc.).