datablations
datablations copied to clipboard
Validation loss vs model size per step
Hi @Muennighoff Great paper, very impressive work and very detailed - thanks for releasing the data! I wonder about a small discrepancy that I see between your work and scaling rules. I replotted the data in figure 15 for 1 epoch, all 3 models on 1 plot:
You can see in Scaling Rules image that more parameters converge faster and have better loss. But in your experiments it seems that the 9B paraments model behave differently What are your thoughts about it? Thanks!