NeMo Questions about logged throughput metrics

I'm trying to do some AutoModel pre-training and see how throughput scales from 4 GPUs to 8 GPUs and what I'm seeing in the throughput metrics has confused me, so I'm hoping I could get some help interpreting this.

In the log snippets below, the it/s is lower and the train_step_timing is higher for 8 GPUs than for 4 GPUs, which makes it look like it's actually slower for 8 GPUs. Does this mean one of the following? a. These are reported per device which slows down because of extra communication overhead, or b. The total throughput is actually lower for 8 GPUs?
In either case, how is it possible that the it/s is lower for 8 GPUs, but the tps is higher? Given I haven't changed anything else, shouldn't these correlate directly?
Why is each micro-batch update logged twice, and why are they inconsistent? I have a micro-batch size (mbs) of 1 and global batch size (gbs) of 32. The global_step is shown as 0 for the first line but 1 for the second line, yet both lines are supposed to be for the same update 33.

4 GPUs

olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 33/2383206 [01:00<1217:48:45,  0.54it/s, v_num=mu23, global_step=0.000, train_step_timing in s=2.200, reduced_train_loss=2.390, tps=8967.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 33/2383206 [01:01<1230:48:39,  0.54it/s, v_num=mu23, global_step=1.000, train_step_timing in s=1.200, reduced_train_loss=2.390, tps=8967.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 34/2383206 [01:02<1217:50:52,  0.54it/s, v_num=mu23, global_step=1.000, train_step_timing in s=1.200, reduced_train_loss=2.390, tps=8967.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 34/2383206 [01:03<1230:46:41,  0.54it/s, v_num=mu23, global_step=1.000, train_step_timing in s=1.190, reduced_train_loss=2.390, tps=8967.0, lr=1e-6]olmo2_cpt/0 [default0]:

8 GPUs:

olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 33/2383206 [01:31<1827:07:46,  0.36it/s, v_num=936u, global_step=0.000, train_step_timing in s=3.270, reduced_train_loss=2.390, tps=11936.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 33/2383206 [01:32<1847:12:03,  0.36it/s, v_num=936u, global_step=1.000, train_step_timing in s=1.760, reduced_train_loss=2.390, tps=11936.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 34/2383206 [01:33<1827:09:43,  0.36it/s, v_num=936u, global_step=1.000, train_step_timing in s=1.760, reduced_train_loss=2.390, tps=11936.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 34/2383206 [01:34<1846:43:03,  0.36it/s, v_num=936u, global_step=1.000, train_step_timing in s=1.760, reduced_train_loss=2.390, tps=11936.0, lr=1e-6]olmo2_cpt/0 [default0]:

I think maybe I'm misunderstanding which metrics are scaled by number of devices and which are not, but would appreciate some help. Re-iterating that nothing else changed between these two runs except the number of GPUs.

Happy to share the config/script/full logs if needed, but hoping we don't need that just to understand the logs.

Thanks in advance!

Jun 12 '25 14:06 Proyag

Hi, am I understood correctly that the configs for 4 GPUs and 8 GPUs are the same?

Also, which callbacks do oyu use to log train_step_timing in s and tps. Or did you just use default recipe without adding any callbacks?

Jun 12 '25 18:06 dimapihtar

Except setting the number of GPUs where relevant, the configs are exactly the same, yes.

It's adapted a bit from some default recipes, but I didn't add any callbacks. Let me get back to you with a minimal version.

Jun 13 '25 09:06 Proyag

I adapted it from this recipe, but I'm running it directly as a script in the nvcr.io/nvidia/nemo:dev container. Here's a version of the script that has the same logging issues described above.

There is a TimingCallback but I'm not sure whether I need to modify that somehow?

# Adapted from nemo.collections.llm.recipes.hf_auto_model_for_causal_lm.pretrain_recipe

import os
import nemo_run as run

from typing import Optional

import lightning.pytorch as pl
import nemo_run as run
from lightning.pytorch.callbacks.callback import Callback

from nemo import lightning as nl
from nemo.collections import llm
from nemo.collections.llm.api import pretrain
from nemo.collections.llm.gpt.model.hf_auto_model_for_causal_lm import HFAutoModelForCausalLM
from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
from nemo.collections.llm.recipes.log.default import default_log, default_resume, tensorboard_logger
from nemo.utils.exp_manager import TimingCallback

PRETRAINED_MODEL = "/pathto/hf/model/checkpoint"
DATA_DIR = "/path/to/nemo/tokenized/dataset"
OUTPUT_DIR = "/workspace/models/cpt"

hyperparameters = {
    "num_nodes": 1,
    "num_gpus_per_node": 8,
    "gbs": 32,
    "mbs": 1,
    "seq_length": 4096,
    "max_steps": 10,
    "lr": 1e-6,
    "run_name": "cpt_test",
}

def model(
    model_name,
    load_pretrained_weights,
    trust_remote_code=False,
    attn_implementation="sdpa"
) -> run.Config[pl.LightningModule]:

    return run.Config(
        HFAutoModelForCausalLM,
        model_name=model_name,
        load_pretrained_weights=load_pretrained_weights,
        trust_remote_code=trust_remote_code,
        attn_implementation=attn_implementation
    )


def trainer(
    num_nodes: int = 1,
    num_gpus_per_node: int = 8,
    callbacks: Optional[list[run.Config[Callback]]] = None,
    gradient_clip_val: float = 1.0,
) -> run.Config[nl.Trainer]:

    strategy = run.Config(
        nl.FSDP2Strategy,
        data_parallel_size=num_gpus_per_node * num_nodes,
        tensor_parallel_size=1,
        context_parallel_size=1
    )

    trainer = run.Config(
        nl.Trainer,
        num_nodes=num_nodes,
        devices=num_gpus_per_node,
        max_epochs=1,
        max_steps=-1, # No limit
        accelerator='gpu',
        strategy=strategy,
        log_every_n_steps=1,
        num_sanity_val_steps=0,
        accumulate_grad_batches=hyperparameters["gbs"] // hyperparameters["mbs"],
        callbacks=callbacks,
        gradient_clip_val=gradient_clip_val,
        use_distributed_sampler=False,
        precision="bf16-mixed"
    )

    return trainer


def pretrain_recipe(
    dir: Optional[str] = None,
    name: str = "default",
    num_nodes: int = 1,
    num_gpus_per_node: int = 8,
    fn=pretrain,
    model_name: str = ''
) -> run.Partial:

    return run.Partial(
        fn,
        model=model(model_name, load_pretrained_weights=True),
        trainer=trainer(
            num_nodes=num_nodes,
            num_gpus_per_node=num_gpus_per_node,
            callbacks=[run.Config(TimingCallback)]
        ),
        data=dataset(
            mbs=hyperparameters["mbs"],
            gbs=hyperparameters["gbs"],  # gbs is unused with AutoModel afaict
            seq_length=hyperparameters["seq_length"]
        ),
        log=default_log(
            dir=dir, name=name,
            tensorboard_logger=tensorboard_logger(name=name)
        ),
        optim=llm.adam.pytorch_adam_with_flat_lr(lr=hyperparameters['lr']),
        resume=default_resume(),
    )


def dataset(
    gbs: int = 512,
    mbs: int = 1,
    seq_length: int = 4096,
) -> run.Config[pl.LightningDataModule]:

    return run.Config(
        llm.PreTrainingDataModule,
        paths={
            "train": [os.path.join(DATA_DIR, "train_text_document")],
            "validation": [os.path.join(DATA_DIR, "validation_text_document")],
            "test": [os.path.join(DATA_DIR, "validation_text_document")]  # Couldn't leave it blank
        },
        seq_length=seq_length,
        global_batch_size=gbs,
        micro_batch_size=mbs,
        tokenizer=run.Config(AutoTokenizer, pretrained_model_name=PRETRAINED_MODEL),
        num_workers=16,
        num_dataset_builder_threads=16,
    )


def configure_recipe(nodes: int = 1, gpus_per_node: int = 4, max_steps: int = -1):
    recipe = pretrain_recipe(
        dir=OUTPUT_DIR, # Path to store checkpoints
        model_name=PRETRAINED_MODEL,
        name=hyperparameters["run_name"],
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
    )

    recipe.trainer.val_check_interval = 5120
    recipe.trainer.max_steps = max_steps

    return recipe

def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
    }

    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

    return executor

def run_pretraining():
    recipe = configure_recipe(
        nodes=hyperparameters["num_nodes"],
        gpus_per_node=hyperparameters["num_gpus_per_node"],
        max_steps=hyperparameters["max_steps"]
    )
    executor = local_executor_torchrun(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)

    run.run(recipe, executor=executor, name="cpt")

# Wrap the call in an if __name__ == "__main__": block to work with Python's multiprocessing module.
if __name__ == "__main__":
    run_pretraining()

Jun 13 '25 11:06 Proyag

@Proyag thanks for the scripts. Which HF model do you use?

Jun 13 '25 14:06 dimapihtar

I was using OLMo2

Jun 16 '25 09:06 Proyag

It looks like that from logs you shared it shows you logs from GPU 0 and train_step_timing in s from GPU 0 respectively.

I also managed to reproduce the script and seeing the logs from a single GPU.

Jun 16 '25 11:06 dimapihtar

Ah ok, thanks for checking.

But when changing from 4 to 8 GPUs, why is tps increasing and it/s decreasing if they are per GPU?
Also, here I have accumulate_grad_batches=hyperparameters["gbs"] // hyperparameters["mbs"], so am I getting the specified global batch size, or is it then effectively getting multiplied by the number of GPUs?
Could you explain the two apparently inconsistent lines logged per update?

Jun 16 '25 13:06 Proyag

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Jul 18 '25 02:07 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Jul 25 '25 02:07 github-actions[bot]