datablations icon indicating copy to clipboard operation
datablations copied to clipboard

A question about the conclusion of this paper

Open OleNet opened this issue 2 years ago • 1 comments

"Scaling Data-Constrained Language Models" is a very nice paper, and I learn a lot from this paper.

However, I have a question about this paper:

In the abstract and Figure 1, it recommends we should train 4 epochs.

But Figure 3 shows that we should choose 59 epochs.

So my question is why the optimal epoch is not 4 epochs in Figure 3.

Thanks in advance.

OleNet avatar May 31 '23 13:05 OleNet

"Scaling Data-Constrained Language Models" is a very nice paper, and I learn a lot from this paper.

However, I have a question about this paper:

In the abstract and Figure 1, it recommends we should train 4 epochs.

But Figure 3 shows that we should choose 59 epochs.

So my question is why the optimal epoch is not 4 epochs in Figure 3.

Thanks in advance.

This is because of immense diminishing returns. So while you will be able to get better loss by training >4 epochs, returns diminish sharply (Figure 5 / attached). At 59 epochs, you're spending a lot of compute to get an extra tiny reduction in loss.

Meanwhile, at 4 epochs returns are still very close to the returns you would get from new data and your compute is well spent.

Lmk if it's unclear!

Screenshot 2023-05-31 at 8 03 28 PM

Muennighoff avatar May 31 '23 18:05 Muennighoff