OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

Only rank0 log metrics to console

Open hxdtest opened this issue 1 year ago • 0 comments

I use python -m torch.distributed.run xxx to launch the training processes. If reduce_global_loss is True, only rank0 reduces global loss and other ranks doesn't reduce. The metrics logging to console by different ranks are confusing.

train/CrossEntropyLoss=0.0370
train/Perplexity=1.038
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288
2024-02-12 15:14:50,866 - olmo.train - INFO - [step=43362/739328]
train/CrossEntropyLoss=0.0380
train/Perplexity=1.039
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288
2024-02-12 15:14:50,866 - olmo.train - INFO - [step=43362/739328]
train/CrossEntropyLoss=0.0378
train/Perplexity=1.039
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288
2024-02-12 15:14:50,866 - olmo.train - INFO - [step=43362/739328]
train/CrossEntropyLoss=0.0383
train/Perplexity=1.039
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288
2024-02-12 15:14:50,866 - olmo.train - INFO - [step=43362/739328]
train/CrossEntropyLoss=2.421
train/Perplexity=11.25
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288

Only rank0 should log metrics to console.

hxdtest avatar Feb 15 '24 03:02 hxdtest