Results 3 issues of Shehper

Hi! The batch size of nanoGPT is batch_size*gradient_accumulation_steps = 12*40 = 480. The batch size mentioned in the GPT-2 paper is 512. May I ask why nanoGPT was trained with...

The code, as written, does not create equally distributed classes.

While running inference on my Mac with MacOS version 13.1, I received the following error: ``` RuntimeError: MPS does not support cumsum_out_mps op with int64 input. Support has been added...