nanoGPT
nanoGPT copied to clipboard
The input Shakespeare file does not contain the entire Shakespeare
The input file with the Shakespeare text https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt has exactly 40000 lines and does not contain the entire Shakespeare, e.g. it does not contain string "Hamlet".
What exactly is in that file and why is it incomplete? Is it on purpose?
Thanks for the great work, btw, your NanoGPT YouTube video is amazing.
well, its the TinyShakespeare dataset. https://www.tensorflow.org/datasets/catalog/tiny_shakespeare
which is labeled as 40000 lines of Shakespeare, so yes, on purpose.
Yeah apparently it isn't all of Shakespeare. Silly but I wasn't aware of it, or more likely I forgot that by now :D. Would love the full works of Shakespeare though...
@karpathy Project Gutenberg seems to have the entire Shakespeare (plays + sonnets + poems) in one TXT file available here:
https://www.gutenberg.org/cache/epub/100/pg100.txt
It has 182k lines.
Removing publishing notes in the beginning/end corresponds to lines 83--181654
Only plays: 2860--177314