llm.c Adding WikiText-103 dataset preprocessing and tokenization

This PR enables the preprocessing and tokenization of the WikiText-103 dataset in response to #246.

A few notes I would like to make on my preprocessing of the text:

I considered two ways of downloading the dataset
1. The dataset can be found on hugging face at https://huggingface.co/datasets/wikitext where going into the wikitext-103-raw-v1 folder shows 4 different parquet files (2 for training, 1 for test, 1 for val). I initially attempted to use this method to download the dataset but I quickly ran into trouble realizing that unless I used pandas, an extra dependency, this would look very messy (please correct me if I am wrong about this I am kinda new to this stuff so maybe there is actually a way to handle parquet files smoothly without pandas). Because of this I opted for the other option.
2. As https://github.com/tysam-code/hlb-gpt/tree/main was mentioned, you can use the same download link used in this code which gives you a zipped folder containing three files of raw text (wiki.train.raw, wiki.val.raw, wiki.test.raw). I am not entirely sure but I got the impression that the text in these files was some sort of translation from the original parquet files from the hugging face version because there were these weird headers/titles that were enclosed in like "=" on both sides. I thought that this download option would be the best choice because it would go smoother based on my knowledge.
To deal with the awkward headers/titles, I made the effort of removing these and leaving just the paragraphs of text/descriptions and separating what would have been paragraphs for a given topic into their own sections with the use of <|endoftext|>. I believe my code takes care of most, if not all, of these headers so that the model can just focus on the text itself and not be possibly confused by the unusual headers that would have had to be encapsulated in <|endoftext|> tokens anyways leaving the model in a position of uncertainty of whether the text to follow after an <|endoftext|> is a header or an actual paragraph. But now that I am writing this I realize that this actually is not that big of a problem because of it decides to do a header then it will follow with the paragraph anyways so I am now thinking that making this effort is pointless. I will just leave it as is in case you don't care although I am pretty sure you would probably prefer to just do simpler preprocessing so that benchmarking is more standardized. In that case, please let me know and I can fix it.
There are possibly better ways of doing this than the way that I did it. If there are, please let me know, and again I can fix it.

That pretty much sums it up. On a rough skim, train_gpt2.cu looked like it could already handle using this dataset right away just by using the i flag and typing data/wikitext-103 but I may have missed something. I thought before diving too deep into other parts that I would at least just get this code for the dataset in.

Apr 28 '24 09:04 joeshmoe0112358

This is cool.

Though it fails for me. Is there a cross platform lib that could do the unzip instead of calling an external process?

 # unzip the file
    data_dir = os.path.join(DATA_CACHE_DIR, "wikitext-103")
    if not os.path.exists(data_dir):
        os.makedirs(data_dir, exist_ok=True)
        print(f"Unzipping {data_filename}...")
        os.system(f"unzip {data_filename} -d {data_dir}")
    else:
        print(f"{data_dir} already exists, skipping unzipping...")

This as well. Maybe open the file with encoding="utf8"?

  File "D:\SRC\word2vec\prepro_wikitext-103.py", line 74, in tokenize
    train_text = open(train_data_filename, 'r').read()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 40: character maps to <undefined>

Also a progress indicator would be very helpful.

Apr 28 '24 17:04 azret

I am not actually sure why it's not working for you. I just pulled all changes so I am on the latest version of the repo and then I just ran python prepro_wikitext-103.py and the script ran for me producing this output:

Downloading https://wikitext.smerity.com/wikitext-103-raw-v1.zip to data/WikiText-103.zip...
data/WikiText-103.zip: 100%|████████████████████████████████████████████████████████| 183M/183M [00:09<00:00, 20.5MiB/s]
Unzipping data/WikiText-103.zip...
Archive:  data/WikiText-103.zip
   creating: data/wikitext-103/wikitext-103-raw/
  inflating: data/wikitext-103/wikitext-103-raw/wiki.test.raw
  inflating: data/wikitext-103/wikitext-103-raw/wiki.valid.raw
  inflating: data/wikitext-103/wikitext-103-raw/wiki.train.raw
Saved 241185 tokens to data/wikitext-103_val.bin
Saved 114933466 tokens to data/wikitext-103_train.bin

So it definitely worked for me. Can you tell me more about where it is failing for you. I see that you made mention of an error message regarding failing to decode. Are there any other error messages or parts that fail or is it only this one? By the way I like your idea about adding a progress bar and am working on implementing this right now. Thanks for the feedback.

Also, does anyone have any recommendations for the surgery I performed on the text to make it simpler by cutting out the headers that I suspect were translated from the original parquet files. Does anyone think I should just leave the headers in there and only add <|endoftext|> tokens throughout the different sections?

Apr 28 '24 19:04 joeshmoe0112358

I have just updated it and added the changes that you have suggested. Please let me know if it now works for you or if you are still finding problems with it. Thanks again for your feedback.

Apr 28 '24 22:04 joeshmoe0112358

Thank you for the update. It now works on Windows as is.

Apr 28 '24 23:04 azret

A few useful references that I found with a quick search:

https://www.reddit.com/r/MachineLearning/comments/oye64h/r_struggling_to_reproduce_perplexity_benchmarks/h7ucco2/
https://huggingface.co/docs/transformers/perplexity
https://github.com/huggingface/transformers/issues/483
https://github.com/openai/gpt-2/issues/78

So ideally we would reproduce the numbers in Table 3 in the GPT-2 paper PDF, but these numbers use some weird detokenizers, and perplexity scaling. So we have two options:

Try to find which detokenizer was used and exactly reproduce Table 3
Ignore the detokenizer and reproduce the perplexity numbers posted by Alec in the Reddit thread on raw WikiText103, with its weirdness and all.

I'm ok with either of these, maybe slight preference to (1) but ok with (2). As part of this PR and before merging, we should try to reproduce either (1) and (2) as close as possible, to give confidence that we did things right.

Apr 29 '24 00:04 karpathy

I am looking into this now and I will update you or ask questions as needed. Thanks for the guidance.

Apr 29 '24 00:04 joeshmoe0112358

Okay after reading on this carefully here are my takeaways:

I am seeing a lot of different numbers on perplexity being reported across the board and many people having difficulty replicating/matching the numbers reported in Table 3 of the GPT-2 paper. Because of this, I think it would be best to first establish a baseline on the raw WikiText-103 text using the table by Alec as a stable anchor point. Once this is achieved, then we can consider further work on figuring out how to preprocess and detokenize the text to reproduce the numbers reported in Table 3.
Since a large focus of this repo is simplicity, attempting to preprocess and reproduce the GPT-2 paper's Table 3 might increase complexity and make things messier. So maybe we should just ignore this and only do (2)? That is unless it is not an issue that the data preprocessing scripts get tricky because really the main priority is in keeping the actual C and CUDA code as clean as possible.
When calculating the perplexity scores we should use a stride length of 32 for evaluation with a window size of 1024 because this is what was used to evaluate GPT-2 and also to produce Alec's table, according to Alec.
Although I have not looked at the main code for training and evaluation, I do not recall anything about computing perplexity scores so this may be more lines of code to set up.

In summary, what I plan to do:

no preprocessing (including no <|endoftext|> injection) and raw WikiText-103 as a baseline
reproduce Alec's number for WikiText-103 GPT-2 perplexity score using a 1024 sliding window and 32 stride length
mess around with hugging face data loader stuff because they were able to replicate the number for gpt2-large on wikitext-2 (based on their docs on perplexity, https://huggingface.co/docs/transformers/perplexity)

If anything I have said here is wrong or you disagree with, please correct me.

Apr 29 '24 02:04 joeshmoe0112358

I'd say for now don't worry about the "mainline" code. Work entirely in the dev or doc folders, with self-contained scripts. E.g. I would take the huggingface script above and see if you can adjust it to reproduce Alec table. You can then push this script to e.g. dev/gpt2_repro.py or something. This script outputs Alec's numbers and isn't part of the rest of the code.
Once we can repro with huggingface code, we can take a look at how we'd potentially be able to get the same numbers with our own train_gpt2.py or something like it (i.e. using our model instead of huggingface model). Possibly still a separate script.
Then we'll move these to a function inside our C code that evaluates the numbers
Finally we'll train from scratch and beat the numbers

This PR specifically could just be (1), optionally (2) is a nice stretch goal. Separate PRs can later do (3) and (4).

Apr 29 '24 02:04 karpathy

Okay there are some contradictions caused by Alec's table and information I think.

Claims/Information:

Alec claims to have produced a table of values for perplexity scores evaluated on the raw wikitext-103 and wikitext-2 without any detokenizers, using a stride length of 32 and a window size equal to the context length. The numbers in question are a perplexity score of 37.50 for wikitext-103 and 34.63 for wikitext-2.
The GPT-2 paper's table has a perplexity score of 37.50 on the preprocessed wikitext-103 and 29.41 on wikitext-2.
My tests using the hugging face script produce a perplexity score of 31.04 on the raw wikitext-103 and wikitext-2 val split, and 29.94 on wikitext-103 and wikitext-2 test split.
The huggingface perplexity scores tutorial reproduces the number for gpt2-large on raw wikitext-2 using a stride length of 1024, and using a stride length of 512 drops the perplexity score from 19.44 (the number reported in the paper is 19.93) to 16.45, showing how big of a difference the stride length can make and also verifying/reinforcing that GPT-2 was evaluated with stride length equal to context length (but also a bit skeptical here because huggingface used the raw text and got the same number?).

Issues:

Under the assumption that the GPT-2 paper ran its evaluation on preprocessed text (also likely on the training split and I will explain why in just a bit), we expect that the GPT-2 paper scores are lower than those of Alec's raw text, which in this case seems to be true for wikitext-2 but not for wikitext-103 which makes me skeptical of Alec's number for wikitext-103 perplexity score.
The test and validation splits of the wikitext-103 dataset are identical to the test and validation splits of the wikitext-2 dataset (they are literally the same text), this means that the perplexity scores reported for wikitext-2 and wikitext-103 should be identical. However, they are not. This could mean a few things:
1. If they did actually evaluate on test or validation splits, they must have randomly sampled from the splits in order to produce this difference in scores (but even then it is highly unlikely that the gap should be this large unless they used a questionably small sample size that created this variation).
2. If they did not randomly sample chunks of text from val/test, they may have run their evaluations on the training splits of each dataset.
The huggingface script got the same perplexity score (using stride length equal to context length) as the paper for gpt2-large but they evaluated on the raw wikitext-2 and the paper evaluated on preprocessed text.

My take on this:

I am quite skeptical and actually a bit unsure of what to make of this because of the many contradictions here (e.g. varying stride lengths, same numbers but different methods, different numbers on same text)
Reproducing these perplexity scores has proved to be a considerably challenging task that will require extensive testing.
It may be best to use the numbers that we produce using the huggingface library as an anchor point instead for reproducing with the mainline implementation because we at least know for sure where our numbers will be coming from and we can make sure that we are testing llm.c with the same exact methods.

What I am going to try next:

I am going to try doing some tests on the code I wrote that preprocesses wikitext-103 to see what perplexity scores result from evaluating on the val and test splits from this. If the numbers are similar to those from the paper, we can have some confidence that the preprocessing I did is close to that done for the paper's tests. However, I kinda doubt this because we still do not know what text the paper tested on other than that it was tested on "WikiText103," but I still think there is a reasonable chance that this may work.

I would like your judgement on this. Please let me know if I have made any mistakes or overlooked any details in my report on this issue.

EDIT: It is not valid in general to expect the preprocessed text to produce a lower perplexity score because preprocessing can potentially get rid of low hanging fruit predictions which ultimately hurts the model's score. In this report I make this invalid assumption which I wish to now correct.

May 01 '24 03:05 joeshmoe0112358

We are abandoning WikiText103 because it's a total mess. We'll instead look at one/few of ARC Easy / Challenge, Squad, Hellaswag, TriviaQA, LAMBADA. Closing.

May 16 '24 22:05 karpathy

llm.c llm.c copied to clipboard

Adding WikiText-103 dataset preprocessing and tokenization

llm.c
llm.c copied to clipboard