Yutao Sun
Yutao Sun
1. --share-decoder-input-output-embed saves model parameters especially when the model size is small. The performance is almost the same. We activate it in our experiment. 2. Don't activate --subln or --deepnorm....
@simran-arora It's better to set bias=False both in layer norm and nn.Linear. Besides, would you mind sharing the training details with us? e.g. corpus, model size, and hyper-parameters. We'd like...
@Kratos-Wen ```decoder_retention_heads``` affects ```key_diim```, which is recommanded to set as 256.
In ```parallel_forward```, you can try setting the padding as 0 after ```mask = mask / mask.sum(dim=-1, keepdim=True).sqrt()```. Your implementation also looks fine to me. In inference, the padding token doesn't...
It seems a bug from apex conflict. You can choose official NVIDIA docker. For simplicity, you can also replace apex-version RMSNorm into PyTorch version.