llm.c WikiText103 eval, attempt to reproduce Alec table posted on Reddit

WikiText103 eval, attempt to reproduce Alec table posted on Reddit

Open karpathy opened this issue 9 months ago • 4 comments

I get:

gpt2-124M nll: 3.058462142944336, ppl: 21.294784545898438

And we're supposed to get https://www.reddit.com/r/MachineLearning/comments/oye64h/comment/h7ucco2/

i.e. 1.17 (in what i assume is the nll), so we're not even close to the right order of magnitude.

May 06 '24 18:05 karpathy

my own failed attempt at https://github.com/karpathy/llm.c/issues/246

May 06 '24 18:05 karpathy

i.e. 1.17 (in what i assume is the nll), so we're not even close to the right order of magnitude.

The formatting of the table is broken, if you look at the leftmost column it says WikiText-2 but gives the number of parameters for the model. I believe the column names should be shifted one to the right with wikitext scores in the range of 20 - 40.

May 06 '24 18:05 joeshmoe0112358

@joeshmoe0112358 ohhh that makes sense RE: column names 🤦‍♂️ . But ok, I am getting ppl ~21 here, and Alec is citing 37.5 for this model

May 06 '24 23:05 karpathy

Yes, I am highly skeptical of the reliability of Alec's table because 37.50 is exactly what the GPT-2 paper reports and they used very different methods (in fact, the first 3 rows of that column are identical). Also, if Alec really did evaluate with stride length = 32, then we would expect his numbers to be significantly lower than the paper's, where lower is better for perplexity. (e.g. check highlighted text)

May 07 '24 03:05 joeshmoe0112358

We are abandoning WikiText103 because it's a total mess. We'll instead look at one/few of ARC Easy / Challenge, Squad, Hellaswag, TriviaQA, LAMBADA. Closing.

May 16 '24 22:05 karpathy

llm.c llm.c copied to clipboard

WikiText103 eval, attempt to reproduce Alec table posted on Reddit

llm.c
llm.c copied to clipboard