Questions about logged throughput metrics
I'm trying to do some AutoModel pre-training and see how throughput scales from 4 GPUs to 8 GPUs and what I'm seeing in the throughput metrics has confused me, so I'm hoping I could get some help interpreting this.
-
In the log snippets below, the
it/sis lower and thetrain_step_timingis higher for 8 GPUs than for 4 GPUs, which makes it look like it's actually slower for 8 GPUs. Does this mean one of the following? a. These are reported per device which slows down because of extra communication overhead, or b. The total throughput is actually lower for 8 GPUs? -
In either case, how is it possible that the
it/sis lower for 8 GPUs, but thetpsis higher? Given I haven't changed anything else, shouldn't these correlate directly? -
Why is each micro-batch update logged twice, and why are they inconsistent? I have a micro-batch size (
mbs) of 1 and global batch size (gbs) of 32. Theglobal_stepis shown as 0 for the first line but 1 for the second line, yet both lines are supposed to be for the same update 33.
4 GPUs
olmo2_cpt/0 [default0]:Epoch 0: 0%| | 33/2383206 [01:00<1217:48:45, 0.54it/s, v_num=mu23, global_step=0.000, train_step_timing in s=2.200, reduced_train_loss=2.390, tps=8967.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0: 0%| | 33/2383206 [01:01<1230:48:39, 0.54it/s, v_num=mu23, global_step=1.000, train_step_timing in s=1.200, reduced_train_loss=2.390, tps=8967.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0: 0%| | 34/2383206 [01:02<1217:50:52, 0.54it/s, v_num=mu23, global_step=1.000, train_step_timing in s=1.200, reduced_train_loss=2.390, tps=8967.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0: 0%| | 34/2383206 [01:03<1230:46:41, 0.54it/s, v_num=mu23, global_step=1.000, train_step_timing in s=1.190, reduced_train_loss=2.390, tps=8967.0, lr=1e-6]olmo2_cpt/0 [default0]:
8 GPUs:
olmo2_cpt/0 [default0]:Epoch 0: 0%| | 33/2383206 [01:31<1827:07:46, 0.36it/s, v_num=936u, global_step=0.000, train_step_timing in s=3.270, reduced_train_loss=2.390, tps=11936.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0: 0%| | 33/2383206 [01:32<1847:12:03, 0.36it/s, v_num=936u, global_step=1.000, train_step_timing in s=1.760, reduced_train_loss=2.390, tps=11936.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0: 0%| | 34/2383206 [01:33<1827:09:43, 0.36it/s, v_num=936u, global_step=1.000, train_step_timing in s=1.760, reduced_train_loss=2.390, tps=11936.0, lr=1e-6]olmo2_cpt/0 [default0]:
olmo2_cpt/0 [default0]:Epoch 0: 0%| | 34/2383206 [01:34<1846:43:03, 0.36it/s, v_num=936u, global_step=1.000, train_step_timing in s=1.760, reduced_train_loss=2.390, tps=11936.0, lr=1e-6]olmo2_cpt/0 [default0]:
I think maybe I'm misunderstanding which metrics are scaled by number of devices and which are not, but would appreciate some help. Re-iterating that nothing else changed between these two runs except the number of GPUs.
Happy to share the config/script/full logs if needed, but hoping we don't need that just to understand the logs.
Thanks in advance!
Hi, am I understood correctly that the configs for 4 GPUs and 8 GPUs are the same?
Also, which callbacks do oyu use to log train_step_timing in s and tps. Or did you just use default recipe without adding any callbacks?
Except setting the number of GPUs where relevant, the configs are exactly the same, yes.
It's adapted a bit from some default recipes, but I didn't add any callbacks. Let me get back to you with a minimal version.
I adapted it from this recipe, but I'm running it directly as a script in the nvcr.io/nvidia/nemo:dev container. Here's a version of the script that has the same logging issues described above.
There is a TimingCallback but I'm not sure whether I need to modify that somehow?
# Adapted from nemo.collections.llm.recipes.hf_auto_model_for_causal_lm.pretrain_recipe
import os
import nemo_run as run
from typing import Optional
import lightning.pytorch as pl
import nemo_run as run
from lightning.pytorch.callbacks.callback import Callback
from nemo import lightning as nl
from nemo.collections import llm
from nemo.collections.llm.api import pretrain
from nemo.collections.llm.gpt.model.hf_auto_model_for_causal_lm import HFAutoModelForCausalLM
from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
from nemo.collections.llm.recipes.log.default import default_log, default_resume, tensorboard_logger
from nemo.utils.exp_manager import TimingCallback
PRETRAINED_MODEL = "/pathto/hf/model/checkpoint"
DATA_DIR = "/path/to/nemo/tokenized/dataset"
OUTPUT_DIR = "/workspace/models/cpt"
hyperparameters = {
"num_nodes": 1,
"num_gpus_per_node": 8,
"gbs": 32,
"mbs": 1,
"seq_length": 4096,
"max_steps": 10,
"lr": 1e-6,
"run_name": "cpt_test",
}
def model(
model_name,
load_pretrained_weights,
trust_remote_code=False,
attn_implementation="sdpa"
) -> run.Config[pl.LightningModule]:
return run.Config(
HFAutoModelForCausalLM,
model_name=model_name,
load_pretrained_weights=load_pretrained_weights,
trust_remote_code=trust_remote_code,
attn_implementation=attn_implementation
)
def trainer(
num_nodes: int = 1,
num_gpus_per_node: int = 8,
callbacks: Optional[list[run.Config[Callback]]] = None,
gradient_clip_val: float = 1.0,
) -> run.Config[nl.Trainer]:
strategy = run.Config(
nl.FSDP2Strategy,
data_parallel_size=num_gpus_per_node * num_nodes,
tensor_parallel_size=1,
context_parallel_size=1
)
trainer = run.Config(
nl.Trainer,
num_nodes=num_nodes,
devices=num_gpus_per_node,
max_epochs=1,
max_steps=-1, # No limit
accelerator='gpu',
strategy=strategy,
log_every_n_steps=1,
num_sanity_val_steps=0,
accumulate_grad_batches=hyperparameters["gbs"] // hyperparameters["mbs"],
callbacks=callbacks,
gradient_clip_val=gradient_clip_val,
use_distributed_sampler=False,
precision="bf16-mixed"
)
return trainer
def pretrain_recipe(
dir: Optional[str] = None,
name: str = "default",
num_nodes: int = 1,
num_gpus_per_node: int = 8,
fn=pretrain,
model_name: str = ''
) -> run.Partial:
return run.Partial(
fn,
model=model(model_name, load_pretrained_weights=True),
trainer=trainer(
num_nodes=num_nodes,
num_gpus_per_node=num_gpus_per_node,
callbacks=[run.Config(TimingCallback)]
),
data=dataset(
mbs=hyperparameters["mbs"],
gbs=hyperparameters["gbs"], # gbs is unused with AutoModel afaict
seq_length=hyperparameters["seq_length"]
),
log=default_log(
dir=dir, name=name,
tensorboard_logger=tensorboard_logger(name=name)
),
optim=llm.adam.pytorch_adam_with_flat_lr(lr=hyperparameters['lr']),
resume=default_resume(),
)
def dataset(
gbs: int = 512,
mbs: int = 1,
seq_length: int = 4096,
) -> run.Config[pl.LightningDataModule]:
return run.Config(
llm.PreTrainingDataModule,
paths={
"train": [os.path.join(DATA_DIR, "train_text_document")],
"validation": [os.path.join(DATA_DIR, "validation_text_document")],
"test": [os.path.join(DATA_DIR, "validation_text_document")] # Couldn't leave it blank
},
seq_length=seq_length,
global_batch_size=gbs,
micro_batch_size=mbs,
tokenizer=run.Config(AutoTokenizer, pretrained_model_name=PRETRAINED_MODEL),
num_workers=16,
num_dataset_builder_threads=16,
)
def configure_recipe(nodes: int = 1, gpus_per_node: int = 4, max_steps: int = -1):
recipe = pretrain_recipe(
dir=OUTPUT_DIR, # Path to store checkpoints
model_name=PRETRAINED_MODEL,
name=hyperparameters["run_name"],
num_nodes=nodes,
num_gpus_per_node=gpus_per_node,
)
recipe.trainer.val_check_interval = 5120
recipe.trainer.max_steps = max_steps
return recipe
def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
# Env vars for jobs are configured here
env_vars = {
"TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
"NCCL_NVLS_ENABLE": "0",
"NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
"NVTE_ASYNC_AMAX_REDUCTION": "1",
}
executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)
return executor
def run_pretraining():
recipe = configure_recipe(
nodes=hyperparameters["num_nodes"],
gpus_per_node=hyperparameters["num_gpus_per_node"],
max_steps=hyperparameters["max_steps"]
)
executor = local_executor_torchrun(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)
run.run(recipe, executor=executor, name="cpt")
# Wrap the call in an if __name__ == "__main__": block to work with Python's multiprocessing module.
if __name__ == "__main__":
run_pretraining()
@Proyag thanks for the scripts. Which HF model do you use?
I was using OLMo2
It looks like that from logs you shared it shows you logs from GPU 0 and train_step_timing in s from GPU 0 respectively.
I also managed to reproduce the script and seeing the logs from a single GPU.
Ah ok, thanks for checking.
- But when changing from 4 to 8 GPUs, why is
tpsincreasing andit/sdecreasing if they are per GPU? - Also, here I have
accumulate_grad_batches=hyperparameters["gbs"] // hyperparameters["mbs"], so am I getting the specified global batch size, or is it then effectively getting multiplied by the number of GPUs? - Could you explain the two apparently inconsistent lines logged per update?
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.