Yutao Sun

Results 5 comments of Yutao Sun

1. --share-decoder-input-output-embed saves model parameters especially when the model size is small. The performance is almost the same. We activate it in our experiment. 2. Don't activate --subln or --deepnorm....

@simran-arora It's better to set bias=False both in layer norm and nn.Linear. Besides, would you mind sharing the training details with us? e.g. corpus, model size, and hyper-parameters. We'd like...

@Kratos-Wen ```decoder_retention_heads``` affects ```key_diim```, which is recommanded to set as 256.

In ```parallel_forward```, you can try setting the padding as 0 after ```mask = mask / mask.sum(dim=-1, keepdim=True).sqrt()```. Your implementation also looks fine to me. In inference, the padding token doesn't...

It seems a bug from apex conflict. You can choose official NVIDIA docker. For simplicity, you can also replace apex-version RMSNorm into PyTorch version.