llm.c Adding GPT2 model evaluation on WikiText-103 with optional preprocessing in dev/model-eval/

Ultimately in response to #246:

In regards to this specific PR, I propose the addition of two files in a new dev/model_eval/ folder. One file prepares wikitext-103 for a model's evaluation with an optional argument that preprocesses the text. The other file evaluates each gpt2 model size on the prepared evaluation data thanks to the help of huggingface.

Also, please refer to my latest message at the bottom of #276 where I describe some contradictions that I found regarding the topic of reproducing numbers. Also, please refer to the repo on my profile called gpt2eval where I have provided two folders, in each is a python notebook where I run some tests and compute perplexity scores for each gpt2 model size on different preparations of the wikitext dataset. I suggest reading my report at the bottom of the last PR first, then seeing the repo on my profile, then finally viewing this code that I am offering in this PR.

For convenience I am providing this table that summarizes the computations from the tests I ran and compares them to the reported GPT2 numbers:

Perplexity Scores from GPT2 Paper:

Model Size	WikiText-2	WikiText-103
117M	29.41	37.50
345M	22.76	26.37
762M	19.93	22.05
1542M	18.34	17.48

All the following numbers are evaluated on WikiText-103 validation split:

Huggingface Dataset:

Model Size	Raw	Bare Minimum Preprocessing
124M	30.59	31.04
355M	22.35	22.51
774M	19.33	20.09
1558M	17.46	17.91

Smerity Dataset:

Model Size	Raw	My Extensive Preprocessing
124M	30.13	33.19
355M	21.77	24.31
774M	18.74	21.39
1558M	16.91	19.32

My analysis of the results:

These tests seem to confirm that something unusual is going on with the number reporting in the GPT2 paper because as I mention in my report, the evaluations on WikiText-2 and WikiText-103 should be identical because the val/test splits of the datasets are respectively identical.
Assuming that the numbers we want to replicate are those in the WikiText-2 column of the GPT2 paper, I believe the most similar results are from the bare minimum preprocessing of the huggingface dataset (I say this off of a quick glance and no supporting calculations to verify this claim). To see the exact differences in the preparations of each test's dataset, you would have to analyze the code in the notebooks on my gpt2eval repo.

May 03 '24 04:05 joeshmoe0112358

Thank you @joeshmoe0112358 for looking into this, but it looks like we're basically not able to match the paper table. In that case I'd at least try to match Alec's post from the reddit thread, which sounds a lot easier to match because it's on raw data. But here in this PR it looks like you're including a bunch of the post-processing?

May 06 '24 10:05 karpathy

To summarize my findings:

The numbers in the WikiText-2 column should be identical to the numbers in the WikiText-103 column because the val/test splits are identical between datasets. However, the numbers are not identical.
Here is Alec's Table:

Model Size WikiText-2 WikiText-103

117M 34.63 37.50

345M 25.63 26.37

762M 21.85 22.05

1542M 20.40 20.04

The first 3 rows in his WikiText-103 column are identical to the first 3 rows in the WikiText-103 column of the paper which should not be the case because he allegedly tested on the raw text while the paper did more gymnastics. Additionally, Alec claims a 32 stride length, this cannot be the case, testing on stride length = context length gets you similar numbers to the paper, but doing stride length = (context length / 2) gives you substantial improvements in scoring, so doing stride length = 32 must give at least this much improvement which is certainly not the case in Alec's table as the numbers are identical or in the same ball park. Thus I am very skeptical of Alec's table and his claims.

Model Size	WikiText-2	WikiText-103
117M	34.63	37.50
345M	25.63	26.37
762M	21.85	22.05
1542M	20.40	20.04

I hypothesize that there is something wrong with the table in the paper, if we instead assume the WikiText-2 column to be the column we wish to replicate (because 103 and 2 have identical val/test splits), then we are able to reasonably reproduce these numbers as shown in my tables in the previous post. We are able to get percent errors like this:

Model Size	GPT-2 Paper's WikiText-2 (PPL)	Huggingface WikiText-103 Val Split w/ Bare Minimum Processing (PPL)	Percent Error (%)
124M	29.41	31.04	5.54
355M	22.76	22.51	1.10
774M	19.93	20.09	0.80
1558M	18.34	17.91	2.34

EDIT: I have been using "pre-processing", "post-processing", "processing", "cleaning", etc. rather loosely when discussing this topic. Let us just say that in general I am preparing the dataset for evaluating a model on it, and optionally I am cleaning the text to get rid of things I deem unnecessary or counterproductive to the evaluation.

May 06 '24 17:05 joeshmoe0112358

We are abandoning WikiText103 because it's a total mess. We'll instead look at one/few of ARC Easy / Challenge, Squad, Hellaswag, TriviaQA, LAMBADA. Closing.

May 16 '24 22:05 karpathy