nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

The input Shakespeare file does not contain the entire Shakespeare

Open dkobak opened this issue 1 year ago • 3 comments

The input file with the Shakespeare text https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt has exactly 40000 lines and does not contain the entire Shakespeare, e.g. it does not contain string "Hamlet".

What exactly is in that file and why is it incomplete? Is it on purpose?

Thanks for the great work, btw, your NanoGPT YouTube video is amazing.

dkobak avatar Mar 22 '23 18:03 dkobak

well, its the TinyShakespeare dataset. https://www.tensorflow.org/datasets/catalog/tiny_shakespeare

which is labeled as 40000 lines of Shakespeare, so yes, on purpose.

Coriana avatar Mar 26 '23 01:03 Coriana

Yeah apparently it isn't all of Shakespeare. Silly but I wasn't aware of it, or more likely I forgot that by now :D. Would love the full works of Shakespeare though...

karpathy avatar Mar 26 '23 23:03 karpathy

@karpathy Project Gutenberg seems to have the entire Shakespeare (plays + sonnets + poems) in one TXT file available here:

https://www.gutenberg.org/cache/epub/100/pg100.txt

It has 182k lines.

Removing publishing notes in the beginning/end corresponds to lines 83--181654

Only plays: 2860--177314

dkobak avatar Mar 27 '23 08:03 dkobak