OLMo
OLMo copied to clipboard
Only rank0 log metrics to console
I use python -m torch.distributed.run xxx to launch the training processes. If reduce_global_loss is True, only rank0 reduces global loss and other ranks doesn't reduce. The metrics logging to console by different ranks are confusing.
train/CrossEntropyLoss=0.0370
train/Perplexity=1.038
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288
2024-02-12 15:14:50,866 - olmo.train - INFO - [step=43362/739328]
train/CrossEntropyLoss=0.0380
train/Perplexity=1.039
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288
2024-02-12 15:14:50,866 - olmo.train - INFO - [step=43362/739328]
train/CrossEntropyLoss=0.0378
train/Perplexity=1.039
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288
2024-02-12 15:14:50,866 - olmo.train - INFO - [step=43362/739328]
train/CrossEntropyLoss=0.0383
train/Perplexity=1.039
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288
2024-02-12 15:14:50,866 - olmo.train - INFO - [step=43362/739328]
train/CrossEntropyLoss=2.421
train/Perplexity=11.25
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288
Only rank0 should log metrics to console.