ModernBERT icon indicating copy to clipboard operation
ModernBERT copied to clipboard

Pretraining Throughput — Unexpected Drop in Speed

Open ebrarkiziloglu opened this issue 8 months ago • 2 comments

Hello,

We are currently pretraining the ModernBERT model using the configuration provided in the pretraining_documentation branch, and we're monitoring throughput via Weights & Biases (see the graphs below).

We computed the training speed using the number of steps at various timestamps (e.g., 30, 60, 90, and 120 minutes), and we observed a notable drop in throughput over time. However, the behavior is inconsistent across the wandb metrics:

  • The first four metrics (tokens/sec, samples/sec, etc.) show fluctuations but no clear downward trend.
  • The last two metrics (batches_per_sec and device/batches_per_sec) show a consistent decline throughout training.
  • Also the speed we acquire does not match with the statistics mentions in your paper.

Please refer to the wandb screenshot below for the graphs.

Image

Questions:

  1. What metric(s) shall we consider when computing the total number of training tokens just like in the paper? (We will also use 8xH100 SXM GPUs in DDP setting)
  2. Could you clarify the specific meanings of these six throughput metrics?
  3. Is it expected to see a continuous drop in batches_per_sec over time, while token and sample rates remain relatively stable?
  4. Did you observe similar behavior in your runs?
  5. What is the reason behind the decline in the last two subfigures?7. Are there known causes (e.g., memory fragmentation, optimizer overhead) that might explain this decline?
  6. Would it be possible for you to share your wandb logs with us?

We would really appreciate your insights on whether this behavior is expected or if it's indicative of a bottleneck in our setup.

ebrarkiziloglu avatar Apr 30 '25 09:04 ebrarkiziloglu

The image didn't upload, so I'm operating off of your description.

If you followed our paper settings with batch size warmup, then it's expected that batches per second will decrease as the batch is increased over the first 3-50 billion tokens. At the same time, the tokens per second should increase slightly, depending on your hardware.

I don't have enough information to answer your other questions.

warner-benjamin avatar May 02 '25 16:05 warner-benjamin

Thank you for your response! I re-uploaded the image. Could you please have a look for our other questions?

ebrarkiziloglu avatar May 05 '25 07:05 ebrarkiziloglu