datablations
datablations copied to clipboard
A question about the conclusion of this paper
"Scaling Data-Constrained Language Models" is a very nice paper, and I learn a lot from this paper.
However, I have a question about this paper:
In the abstract and Figure 1, it recommends we should train 4 epochs.
But Figure 3 shows that we should choose 59 epochs.
So my question is why the optimal epoch is not 4 epochs in Figure 3.
Thanks in advance.
"Scaling Data-Constrained Language Models" is a very nice paper, and I learn a lot from this paper.
However, I have a question about this paper:
In the abstract and Figure 1, it recommends we should train 4 epochs.
But Figure 3 shows that we should choose 59 epochs.
So my question is why the optimal epoch is not 4 epochs in Figure 3.
Thanks in advance.
This is because of immense diminishing returns. So while you will be able to get better loss by training >4 epochs, returns diminish sharply (Figure 5 / attached). At 59 epochs, you're spending a lot of compute to get an extra tiny reduction in loss.
Meanwhile, at 4 epochs returns are still very close to the returns you would get from new data and your compute is well spent.
Lmk if it's unclear!