NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Questions about logged throughput metrics

Open Proyag opened this issue 7 months ago • 4 comments

I'm trying to do some AutoModel pre-training and see how throughput scales from 4 GPUs to 8 GPUs and what I'm seeing in the throughput metrics has confused me, so I'm hoping I could get some help interpreting this.

  1. In the log snippets below, the it/s is lower and the train_step_timing is higher for 8 GPUs than for 4 GPUs, which makes it look like it's actually slower for 8 GPUs. Does this mean one of the following? a. These are reported per device which slows down because of extra communication overhead, or b. The total throughput is actually lower for 8 GPUs?

  2. In either case, how is it possible that the it/s is lower for 8 GPUs, but the tps is higher? Given I haven't changed anything else, shouldn't these correlate directly?

  3. Why is each micro-batch update logged twice, and why are they inconsistent? I have a micro-batch size (mbs) of 1 and global batch size (gbs) of 32. The global_step is shown as 0 for the first line but 1 for the second line, yet both lines are supposed to be for the same update 33.

4 GPUs

olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 33/2383206 [01:00<1217:48:45,  0.54it/s, v_num=mu23, global_step=0.000, train_step_timing in s=2.200, reduced_train_loss=2.390, tps=8967.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 33/2383206 [01:01<1230:48:39,  0.54it/s, v_num=mu23, global_step=1.000, train_step_timing in s=1.200, reduced_train_loss=2.390, tps=8967.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 34/2383206 [01:02<1217:50:52,  0.54it/s, v_num=mu23, global_step=1.000, train_step_timing in s=1.200, reduced_train_loss=2.390, tps=8967.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 34/2383206 [01:03<1230:46:41,  0.54it/s, v_num=mu23, global_step=1.000, train_step_timing in s=1.190, reduced_train_loss=2.390, tps=8967.0, lr=1e-6]olmo2_cpt/0 [default0]:

8 GPUs:

olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 33/2383206 [01:31<1827:07:46,  0.36it/s, v_num=936u, global_step=0.000, train_step_timing in s=3.270, reduced_train_loss=2.390, tps=11936.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 33/2383206 [01:32<1847:12:03,  0.36it/s, v_num=936u, global_step=1.000, train_step_timing in s=1.760, reduced_train_loss=2.390, tps=11936.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 34/2383206 [01:33<1827:09:43,  0.36it/s, v_num=936u, global_step=1.000, train_step_timing in s=1.760, reduced_train_loss=2.390, tps=11936.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0:   0%|          | 34/2383206 [01:34<1846:43:03,  0.36it/s, v_num=936u, global_step=1.000, train_step_timing in s=1.760, reduced_train_loss=2.390, tps=11936.0, lr=1e-6]olmo2_cpt/0 [default0]:

I think maybe I'm misunderstanding which metrics are scaled by number of devices and which are not, but would appreciate some help. Re-iterating that nothing else changed between these two runs except the number of GPUs.

Happy to share the config/script/full logs if needed, but hoping we don't need that just to understand the logs.

Thanks in advance!

Proyag avatar Jun 12 '25 14:06 Proyag

Hi, am I understood correctly that the configs for 4 GPUs and 8 GPUs are the same?

Also, which callbacks do oyu use to log train_step_timing in s and tps. Or did you just use default recipe without adding any callbacks?

dimapihtar avatar Jun 12 '25 18:06 dimapihtar

Except setting the number of GPUs where relevant, the configs are exactly the same, yes.

It's adapted a bit from some default recipes, but I didn't add any callbacks. Let me get back to you with a minimal version.

Proyag avatar Jun 13 '25 09:06 Proyag

I adapted it from this recipe, but I'm running it directly as a script in the nvcr.io/nvidia/nemo:dev container. Here's a version of the script that has the same logging issues described above.

There is a TimingCallback but I'm not sure whether I need to modify that somehow?

# Adapted from nemo.collections.llm.recipes.hf_auto_model_for_causal_lm.pretrain_recipe

import os
import nemo_run as run

from typing import Optional

import lightning.pytorch as pl
import nemo_run as run
from lightning.pytorch.callbacks.callback import Callback

from nemo import lightning as nl
from nemo.collections import llm
from nemo.collections.llm.api import pretrain
from nemo.collections.llm.gpt.model.hf_auto_model_for_causal_lm import HFAutoModelForCausalLM
from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
from nemo.collections.llm.recipes.log.default import default_log, default_resume, tensorboard_logger
from nemo.utils.exp_manager import TimingCallback

PRETRAINED_MODEL = "/pathto/hf/model/checkpoint"
DATA_DIR = "/path/to/nemo/tokenized/dataset"
OUTPUT_DIR = "/workspace/models/cpt"

hyperparameters = {
    "num_nodes": 1,
    "num_gpus_per_node": 8,
    "gbs": 32,
    "mbs": 1,
    "seq_length": 4096,
    "max_steps": 10,
    "lr": 1e-6,
    "run_name": "cpt_test",
}

def model(
    model_name,
    load_pretrained_weights,
    trust_remote_code=False,
    attn_implementation="sdpa"
) -> run.Config[pl.LightningModule]:

    return run.Config(
        HFAutoModelForCausalLM,
        model_name=model_name,
        load_pretrained_weights=load_pretrained_weights,
        trust_remote_code=trust_remote_code,
        attn_implementation=attn_implementation
    )


def trainer(
    num_nodes: int = 1,
    num_gpus_per_node: int = 8,
    callbacks: Optional[list[run.Config[Callback]]] = None,
    gradient_clip_val: float = 1.0,
) -> run.Config[nl.Trainer]:

    strategy = run.Config(
        nl.FSDP2Strategy,
        data_parallel_size=num_gpus_per_node * num_nodes,
        tensor_parallel_size=1,
        context_parallel_size=1
    )

    trainer = run.Config(
        nl.Trainer,
        num_nodes=num_nodes,
        devices=num_gpus_per_node,
        max_epochs=1,
        max_steps=-1, # No limit
        accelerator='gpu',
        strategy=strategy,
        log_every_n_steps=1,
        num_sanity_val_steps=0,
        accumulate_grad_batches=hyperparameters["gbs"] // hyperparameters["mbs"],
        callbacks=callbacks,
        gradient_clip_val=gradient_clip_val,
        use_distributed_sampler=False,
        precision="bf16-mixed"
    )

    return trainer


def pretrain_recipe(
    dir: Optional[str] = None,
    name: str = "default",
    num_nodes: int = 1,
    num_gpus_per_node: int = 8,
    fn=pretrain,
    model_name: str = ''
) -> run.Partial:

    return run.Partial(
        fn,
        model=model(model_name, load_pretrained_weights=True),
        trainer=trainer(
            num_nodes=num_nodes,
            num_gpus_per_node=num_gpus_per_node,
            callbacks=[run.Config(TimingCallback)]
        ),
        data=dataset(
            mbs=hyperparameters["mbs"],
            gbs=hyperparameters["gbs"],  # gbs is unused with AutoModel afaict
            seq_length=hyperparameters["seq_length"]
        ),
        log=default_log(
            dir=dir, name=name,
            tensorboard_logger=tensorboard_logger(name=name)
        ),
        optim=llm.adam.pytorch_adam_with_flat_lr(lr=hyperparameters['lr']),
        resume=default_resume(),
    )


def dataset(
    gbs: int = 512,
    mbs: int = 1,
    seq_length: int = 4096,
) -> run.Config[pl.LightningDataModule]:

    return run.Config(
        llm.PreTrainingDataModule,
        paths={
            "train": [os.path.join(DATA_DIR, "train_text_document")],
            "validation": [os.path.join(DATA_DIR, "validation_text_document")],
            "test": [os.path.join(DATA_DIR, "validation_text_document")]  # Couldn't leave it blank
        },
        seq_length=seq_length,
        global_batch_size=gbs,
        micro_batch_size=mbs,
        tokenizer=run.Config(AutoTokenizer, pretrained_model_name=PRETRAINED_MODEL),
        num_workers=16,
        num_dataset_builder_threads=16,
    )


def configure_recipe(nodes: int = 1, gpus_per_node: int = 4, max_steps: int = -1):
    recipe = pretrain_recipe(
        dir=OUTPUT_DIR, # Path to store checkpoints
        model_name=PRETRAINED_MODEL,
        name=hyperparameters["run_name"],
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
    )

    recipe.trainer.val_check_interval = 5120
    recipe.trainer.max_steps = max_steps

    return recipe

def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
    }

    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

    return executor

def run_pretraining():
    recipe = configure_recipe(
        nodes=hyperparameters["num_nodes"],
        gpus_per_node=hyperparameters["num_gpus_per_node"],
        max_steps=hyperparameters["max_steps"]
    )
    executor = local_executor_torchrun(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)

    run.run(recipe, executor=executor, name="cpt")

# Wrap the call in an if __name__ == "__main__": block to work with Python's multiprocessing module.
if __name__ == "__main__":
    run_pretraining()

Proyag avatar Jun 13 '25 11:06 Proyag

@Proyag thanks for the scripts. Which HF model do you use?

dimapihtar avatar Jun 13 '25 14:06 dimapihtar

I was using OLMo2

Proyag avatar Jun 16 '25 09:06 Proyag

It looks like that from logs you shared it shows you logs from GPU 0 and train_step_timing in s from GPU 0 respectively.

I also managed to reproduce the script and seeing the logs from a single GPU.

dimapihtar avatar Jun 16 '25 11:06 dimapihtar

Ah ok, thanks for checking.

  • But when changing from 4 to 8 GPUs, why is tps increasing and it/s decreasing if they are per GPU?
  • Also, here I have accumulate_grad_batches=hyperparameters["gbs"] // hyperparameters["mbs"], so am I getting the specified global batch size, or is it then effectively getting multiplied by the number of GPUs?
  • Could you explain the two apparently inconsistent lines logged per update?

Proyag avatar Jun 16 '25 13:06 Proyag

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jul 18 '25 02:07 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Jul 25 '25 02:07 github-actions[bot]