massive-activations
massive-activations copied to clipboard
Training only on 2B tokens (openwebtext)
Hi ! Interesting work on the role of explicit bias!
I was wondering what training settings got you an eval PPL ~3.04. The paper mentions that 50K iterations are required to train the GPT-2 model on 2B tokens. What was the bacth_size_per_device and block_size for the same? Did you do training from scratch or fine-tune the pre-trained model (trained on 300B tokens)?
Thanks!