massive-activations Training only on 2B tokens (openwebtext)

Training only on 2B tokens (openwebtext)

Open Nandan91 opened this issue 11 months ago • 3 comments

Hi ! Interesting work on the role of explicit bias!

I was wondering what training settings got you an eval PPL ~3.04. The paper mentions that 50K iterations are required to train the GPT-2 model on 2B tokens. What was the bacth_size_per_device and block_size for the same? Did you do training from scratch or fine-tune the pre-trained model (trained on 300B tokens)?

Thanks!

Mar 22 '24 18:03 Nandan91

massive-activations massive-activations copied to clipboard

Training only on 2B tokens (openwebtext)

massive-activations
massive-activations copied to clipboard