nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Add two popular datasets for character level LM

Open entron opened this issue 1 year ago • 4 comments

Added data preparation and example trainning config scripts for two popular datasets: text8 and enwik8.

entron avatar Apr 25 '23 07:04 entron

Also tried out feeding the outputs of each layer to itself mutiple times in the 2nd commit. For the shakespeare_char dataset, this actually gives better val at 1.4543 with only 1 layer and 1.8M parameters. For bigger datasets such as text8, this also gives better results when the number of parameters are the same. Haven't tested on GPT-2 yet. The 2nd commit may be not so relavant though.

entron avatar Apr 25 '23 11:04 entron

That's nice, but prefer we keep n_layer_update separate

karpathy avatar Apr 26 '23 03:04 karpathy

I have removed the 2nd commit.

entron avatar Apr 26 '23 05:04 entron

Maybe it's about time to have a separate .py file with the shared logic? Because all prepare.py files for shakespeare and these two new datasets basically do the same thing. I understand that it's sometimes better to have some code duplication for the sake of simplicity and easiness of understanding, but this is not the case (in my opinion).

I am opened to hear why I am wrong (again 😄 ).

Andrei-Aksionov avatar Apr 27 '23 11:04 Andrei-Aksionov