attention-is-all-you-need-pytorch Convergence / Overfit issue

Convergence / Overfit issue

Open astariul opened this issue 5 years ago • 2 comments

This repo is really well-written, so I decided to use it for my task : question generation.

In my architecture, I'm using only DecoderTransformer (not the whole Transformer). But I have a convergence issue, similar to #101, where the model can't overfit 10 samples.

As mentionned, I tried to decrease the learning rate, as well as changing the optimizer, but nothing work, my model simply never converge.

I'm wondering if anyone met convergence issue, and how they resolve it ! Thanks for the help

May 14 '19 08:05 astariul

I could make it converge :

By using BertAdam from pytorch-pretrained-BERT
By not sharing the weights of the final target embeddings (tgt_emb_prj_weight_sharing = False)

Now I have another problem : my architecture is overfitting...

I tried scaling down some parameters, but it still overfit, and affect performance.

If someone have some insights to share, I'll be glad to take it !

May 20 '19 00:05 astariul

so ,could i konw how to solve the overfitting problem,thanks !!^_^

Oct 28 '19 11:10 zhao1402072392

attention-is-all-you-need-pytorch attention-is-all-you-need-pytorch copied to clipboard

Convergence / Overfit issue

attention-is-all-you-need-pytorch
attention-is-all-you-need-pytorch copied to clipboard