attention-is-all-you-need-pytorch
attention-is-all-you-need-pytorch copied to clipboard
Convergence / Overfit issue
This repo is really well-written, so I decided to use it for my task : question generation.
In my architecture, I'm using only DecoderTransformer (not the whole Transformer). But I have a convergence issue, similar to #101, where the model can't overfit 10 samples.
As mentionned, I tried to decrease the learning rate, as well as changing the optimizer, but nothing work, my model simply never converge.
I'm wondering if anyone met convergence issue, and how they resolve it ! Thanks for the help
I could make it converge :
- By using
BertAdam
from pytorch-pretrained-BERT - By not sharing the weights of the final target embeddings (
tgt_emb_prj_weight_sharing = False
)
Now I have another problem : my architecture is overfitting...
I tried scaling down some parameters, but it still overfit, and affect performance.
If someone have some insights to share, I'll be glad to take it !
so ,could i konw how to solve the overfitting problem,thanks !!^_^