GPT2 quirks that hold the model back

quirks that hold the model back

Open murpen opened this issue 5 years ago • 4 comments

In Addendum: Evaluation of My Model you mention:

Although I used the same amount of hardware (or more), the differences in my training setup and hyperparameters made a significant difference. Which is an unfortunate reality to anyone familiar with reproducing deep learning papers. I don’t think my model in its current state is even as dangerous as 117M in its text generating abilities. But I believe to have found the quirks in my setup that have held the model back, and they are easy to fix.

Are you willing to elaborate on this, and describe or fix the quirks? I think it would be really interesting/informative/useful for students of deep learning as a case study, showing how small non-obvious changes can make a big difference. Please consider doing so :) Thank you.

Jun 15 '19 14:06 murpen

I'm currently investigating these quirks in fact! I'll talk about this more if my hunches are confirmed, but it might take a while.

Jun 16 '19 13:06 ConnorJL

Any updates regarding quirks? Really interested in this topic

Jun 27 '19 13:06 Lerbytech

Unfortunately not much interesting to report so far. I've tried several tweaks, to no avail. I'll continue experimenting for a while before I compile my results.

Jun 28 '19 09:06 ConnorJL

One of the main suspects for my model's worse performance is weight initialization. I just pushed some new code that should allow for different kinds of weight initialization and make weight initialization closer to the original work (though we can't know for sure since there is no public code of how weight initialization was done).

Jul 12 '19 11:07 ConnorJL

GPT2 GPT2 copied to clipboard

quirks that hold the model back

GPT2
GPT2 copied to clipboard