llm.c
llm.c copied to clipboard
GPT-2 from scratch
There are two dev scripts in this PR
1. gpt2-124M-from-scratch.py
Simply creates a new GPT-2 124M model from scratch and saves the corresponding weights to gpt2_124M.bin. Will be useful when full C/CUDA backprop is ready to try to train from scratch from C.
2. prepro_tinyshakespeare_char.py
This will create a very tiny char-level model similar to the one from nanoGPT. This is useful during development for testing.
Additionally, removed the hardcoded GPT2_EOT token and moved it into the Tokenizer. This way we are not limited to the GPT2 tokenizer.
#154
After running prepro_tinyshakespeare_char.py you can run the train-gpt2.c and you should see the following results.
[GPT-2]
max_seq_len: 256
vocab_size: 66
num_layers: 6
num_heads: 6
channels: 384
num_parameters: 10771200
train dataset num_batches: 438
val dataset num_batches: 3946
num_activations: 12033792
val loss 6.971008
step 0: train loss 4.845659 (took 2141 ms)
step 1: train loss 4.769723 (took 2453 ms)
step 2: train loss 4.399278 (took 2422 ms)
step 3: train loss 3.885524 (took 2453 ms)
step 4: train loss 4.519618 (took 2438 ms)
step 5: train loss 4.488451 (took 2484 ms)
step 6: train loss 5.019192 (took 2437 ms)
step 7: train loss 4.146207 (took 2500 ms)
step 8: train loss 4.76461 (took 2500 ms)
step 9: train loss 4.560529 (took 2579 ms)
val loss 6.444646
step 10: train loss 3.978859 (took 3672 ms)
step 11: train loss 4.340317 (took 3531 ms)
step 12: train loss 4.434982 (took 3360 ms)
step 13: train loss 3.499396 (took 3406 ms)
step 14: train loss 3.611105 (took 3547 ms)
step 15: train loss 4.427991 (took 3500 ms)
step 16: train loss 3.609078 (took 3453 ms)
step 17: train loss 3.486021 (took 3359 ms)
step 18: train loss 3.66163 (took 3438 ms)
step 19: train loss 3.688834 (took 3422 ms)
val loss 5.842363
generating:
---
GREMIO:
Ay, marry, sir, now it begins to work.