llm.c
llm.c copied to clipboard
WikiText103 eval, attempt to reproduce Alec table posted on Reddit
I get:
gpt2-124M nll: 3.058462142944336, ppl: 21.294784545898438
And we're supposed to get https://www.reddit.com/r/MachineLearning/comments/oye64h/comment/h7ucco2/
i.e. 1.17 (in what i assume is the nll), so we're not even close to the right order of magnitude.
my own failed attempt at https://github.com/karpathy/llm.c/issues/246
i.e. 1.17 (in what i assume is the nll), so we're not even close to the right order of magnitude.
The formatting of the table is broken, if you look at the leftmost column it says WikiText-2 but gives the number of parameters for the model. I believe the column names should be shifted one to the right with wikitext scores in the range of 20 - 40.
@joeshmoe0112358 ohhh that makes sense RE: column names 🤦♂️ . But ok, I am getting ppl ~21 here, and Alec is citing 37.5 for this model
Yes, I am highly skeptical of the reliability of Alec's table because 37.50 is exactly what the GPT-2 paper reports and they used very different methods (in fact, the first 3 rows of that column are identical). Also, if Alec really did evaluate with stride length = 32, then we would expect his numbers to be significantly lower than the paper's, where lower is better for perplexity. (e.g. check highlighted text)
We are abandoning WikiText103 because it's a total mess. We'll instead look at one/few of ARC Easy / Challenge, Squad, Hellaswag, TriviaQA, LAMBADA. Closing.