DeepSpeed
DeepSpeed copied to clipboard
Uses bf16 training, there is an abnormal loss[BUG]
Describe the bug I use the neox framework to load the weight of llama to continue training. When I use bf16, the loss becomes very large after the weight is updated for the first time in the model. Because neox also uses deepspeed, so the reason for deepspeed is suspected. Has anyone been able to successfully load llama weights with deepspeed, to continue training with bf16
steps: 1 loss: 1.9300 iter time (s): 13.338 samples/sec: 1.200
samples/sec: 1.195 | iteration 1/ 2140 | elapsed time per iteration (ms): 13386.0 | learning rate: 1.417E-07 | approx flops per GPU: 18.8TFLOPS | lm_loss: 1.930013E+00 | number of skipped iterations: 0 | number of nan iterations: 0 |
after 1 iterations memory (MB) | allocated: 26732.29150390625 | max allocated: 30516.369140625 | reserved: 33468.0 | max reserved: 33468.0
time (ms)
[2023-06-12 18:08:09,410] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[2.834890965732087e-07, 2.834890965732087e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
steps: 2 loss: 11.4175 iter time (s): 2.441 samples/sec: 6.554
samples/sec: 6.551 | iteration 2/ 2140 | elapsed time per iteration (ms): 2442.5 | learning rate: 2.835E-07 | approx flops per GPU: 103.2TFLOPS | lm_loss: 1.141747E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms)
[2023-06-12 18:08:11,797] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[4.2523364485981307e-07, 4.2523364485981307e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
steps: 3 loss: 11.4233 iter time (s): 2.386 samples/sec: 6.706
samples/sec: 6.703 | iteration 3/ 2140 | elapsed time per iteration (ms): 2386.9 | learning rate: 4.252E-07 | approx flops per GPU: 105.6TFLOPS | lm_loss: 1.142331E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms)
[
Hi @suolyer,
May I ask which stage of ZeRO you are using her? Thanks, Reza
I am also having this issue with zero stage 1. Here is the log of my first few steps of training:
[2023-07-25 16:20:45,998] [INFO] [checkpointing.py:529:forward] Activation Checkpointing Information
[2023-07-25 16:20:45,999] [INFO] [checkpointing.py:530:forward] ----Partition Activations True, CPU CHECKPOINTING False
[2023-07-25 16:20:45,999] [INFO] [checkpointing.py:531:forward] ----contiguous Memory Checkpointing False with 32 total layers
[2023-07-25 16:20:45,999] [INFO] [checkpointing.py:533:forward] ----Synchronization True
[2023-07-25 16:20:45,999] [INFO] [checkpointing.py:534:forward] ----Profiling time in checkpointing False
[2023-07-25 16:20:52,703] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[6.209123271794022e-08, 6.209123271794022e-08], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 16:20:52,730] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1961.48 | backward_microstep: 3392.85 | backward_inner_microstep: 1977.73 | backward_allreduce_microstep: 1415.07 | step_microstep: 1371.42
[2023-07-25 16:20:52,730] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 1961.44 | backward: 3392.84 | backward_inner: 1977.71 | backward_allreduce: 1415.07 | step: 1371.42
samples/sec: 4.603 | iteration 1/ 10000 | elapsed time per iteration (ms): 6952.6 | learning rate: 6.209E-08 | approx flops per GPU: 66.6TFLOPS | lm_loss: 1.400030E+00 | number of skipped iterations: 0 | number of nan iterations: 0 |
after 1 iterations memory (MB) | allocated: 61084.4287109375 | max allocated: 63654.8291015625 | reserved: 63692.0 | max reserved: 78556.0
time (ms) | forward: 2072.79 | backward: 3393.05 | backward-backward: 3392.96 | backward-allreduce: 0.00 | optimizer: 1371.81 | batch generator: 98.76
[2023-07-25 16:20:56,739] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[1.2418246543588044e-07, 1.2418246543588044e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 16:20:57,066] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 656.84 | backward_microstep: 2141.13 | backward_inner_microstep: 1978.54 | backward_allreduce_microstep: 162.54 | step_microstep: 1412.30
[2023-07-25 16:20:57,066] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 656.83 | backward: 2141.11 | backward_inner: 1978.53 | backward_allreduce: 162.54 | step: 1412.29
samples/sec: 7.385 | iteration 2/ 10000 | elapsed time per iteration (ms): 4332.8 | learning rate: 1.242E-07 | approx flops per GPU: 106.9TFLOPS | lm_loss: 1.118721E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 665.55 | backward: 2141.34 | backward-backward: 2141.24 | backward-allreduce: 0.00 | optimizer: 1413.20 | batch generator: 2.06
[2023-07-25 16:21:00,442] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[1.862736981538207e-07, 1.862736981538207e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 16:21:01,390] [INFO] [timer.py:215:stop] epoch=0/micro_step=3/global_step=3, RunningAvgSamplesPerSec=7.602492245859087, CurrSamplesPerSec=7.602492245859087, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 16:21:01,391] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 646.41 | backward_microstep: 2144.74 | backward_inner_microstep: 1981.98 | backward_allreduce_microstep: 162.72 | step_microstep: 1412.80
[2023-07-25 16:21:01,392] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 646.39 | backward: 2144.73 | backward_inner: 1981.98 | backward_allreduce: 162.71 | step: 1412.81
samples/sec: 7.403 | iteration 3/ 10000 | elapsed time per iteration (ms): 4322.4 | learning rate: 1.863E-07 | approx flops per GPU: 107.2TFLOPS | lm_loss: 1.119876E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 654.89 | backward: 2144.92 | backward-backward: 2144.84 | backward-allreduce: 0.00 | optimizer: 1413.41 | batch generator: 1.76
[2023-07-25 16:21:04,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[2.483649308717609e-07, 2.483649308717609e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 16:21:05,431] [INFO] [timer.py:215:stop] epoch=0/micro_step=4/global_step=4, RunningAvgSamplesPerSec=7.866832930714489, CurrSamplesPerSec=8.150218192964873, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 16:21:05,431] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 643.06 | backward_microstep: 2146.94 | backward_inner_microstep: 1983.82 | backward_allreduce_microstep: 163.07 | step_microstep: 1130.08
[2023-07-25 16:21:05,431] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 643.05 | backward: 2146.93 | backward_inner: 1983.82 | backward_allreduce: 163.07 | step: 1130.09
samples/sec: 7.917 | iteration 4/ 10000 | elapsed time per iteration (ms): 4042.0 | learning rate: 2.484E-07 | approx flops per GPU: 114.6TFLOPS | lm_loss: 1.119089E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 650.80 | backward: 2147.19 | backward-backward: 2147.13 | backward-allreduce: 0.00 | optimizer: 1130.63 | batch generator: 1.12
[2023-07-25 16:21:09,078] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[3.1045616358970115e-07, 3.1045616358970115e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 16:21:09,806] [INFO] [timer.py:215:stop] epoch=0/micro_step=5/global_step=5, RunningAvgSamplesPerSec=7.745333076985001, CurrSamplesPerSec=7.513255085261663, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
The following is my deepspeed config. I am using cuda 11.7 with torch 1.13.0 on 8xA100.
{
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00003,
"betas": [0.9, 0.95],
"eps": 1.0e-8,
"weight_decay": 0.1,
}
},
"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 1260000000,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1260000000,
"contiguous_gradients": true,
"cpu_offload": false
},
"gradient_clipping": 0.1,
"bf16": {
"enabled": true,
},
"data_types": {
"grad_accum_dtype": "fp32",
},
"fp32_allreduce": true,
}
Note that If I set gradient clipping to some very low value (in the following log, 1e-7) the loss jump at step 1 still occurs, but the loss doesn't change much in subsequent steps. This suggests to me gradient clipping is working correctly most of the time, but something strange is going on at step 1.
samples/sec: 5.597 | iteration 1/ 10000 | elapsed time per iteration (ms): 5717.3 | learning rate: 6.209E-08 | approx flops per GPU: 81.0TFLOPS | lm_loss: 1.400030E+00 | number of skipped iterations: 0 | number of nan iterations: 0 |
after 1 iterations memory (MB) | allocated: 61084.4287109375 | max allocated: 63654.8291015625 | reserved: 63692.0 | max reserved: 78556.0
time (ms) | forward: 1735.28 | backward: 2990.72 | backward-backward: 2990.63 | backward-allreduce: 0.00 | optimizer: 880.43 | batch generator: 4.81
[2023-07-25 21:01:03,424] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[1.2418246543588044e-07, 1.2418246543588044e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:03,502] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 646.77 | backward_microstep: 2142.36 | backward_inner_microstep: 1979.65 | backward_allreduce_microstep: 162.67 | step_microstep: 758.67
[2023-07-25 21:01:03,502] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 646.77 | backward: 2142.35 | backward_inner: 1979.64 | backward_allreduce: 162.67 | step: 758.66
samples/sec: 8.726 | iteration 2/ 10000 | elapsed time per iteration (ms): 3667.3 | learning rate: 1.242E-07 | approx flops per GPU: 126.3TFLOPS | lm_loss: 1.118721E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 656.09 | backward: 2142.44 | backward-backward: 2142.41 | backward-allreduce: 0.00 | optimizer: 758.94 | batch generator: 2.67
[2023-07-25 21:01:06,876] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[1.862736981538207e-07, 1.862736981538207e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:07,425] [INFO] [timer.py:215:stop] epoch=0/micro_step=3/global_step=3, RunningAvgSamplesPerSec=8.39291566396453, CurrSamplesPerSec=8.39291566396453, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:01:07,427] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 647.83 | backward_microstep: 2143.19 | backward_inner_microstep: 1980.58 | backward_allreduce_microstep: 162.57 | step_microstep: 1017.59
[2023-07-25 21:01:07,427] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 647.82 | backward: 2143.18 | backward_inner: 1980.58 | backward_allreduce: 162.57 | step: 1017.60
samples/sec: 8.160 | iteration 3/ 10000 | elapsed time per iteration (ms): 3921.7 | learning rate: 1.863E-07 | approx flops per GPU: 118.1TFLOPS | lm_loss: 1.119864E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 655.56 | backward: 2143.27 | backward-backward: 2143.24 | backward-allreduce: 0.00 | optimizer: 1018.01 | batch generator: 1.05
[2023-07-25 21:01:11,301] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[2.483649308717609e-07, 2.483649308717609e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:11,354] [INFO] [timer.py:215:stop] epoch=0/micro_step=4/global_step=4, RunningAvgSamplesPerSec=8.383792559148839, CurrSamplesPerSec=8.374689266439178, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:01:11,355] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 647.64 | backward_microstep: 2146.25 | backward_inner_microstep: 1983.89 | backward_allreduce_microstep: 162.31 | step_microstep: 1020.84
[2023-07-25 21:01:11,355] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 647.63 | backward: 2146.24 | backward_inner: 1983.89 | backward_allreduce: 162.31 | step: 1020.85
samples/sec: 8.144 | iteration 4/ 10000 | elapsed time per iteration (ms): 3929.4 | learning rate: 2.484E-07 | approx flops per GPU: 117.9TFLOPS | lm_loss: 1.119102E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 654.99 | backward: 2146.32 | backward-backward: 2146.29 | backward-allreduce: 0.00 | optimizer: 1021.19 | batch generator: 0.75
[2023-07-25 21:01:14,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[3.1045616358970115e-07, 3.1045616358970115e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:15,326] [INFO] [timer.py:215:stop] epoch=0/micro_step=5/global_step=5, RunningAvgSamplesPerSec=8.349940296273406, CurrSamplesPerSec=8.283049336940612, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:01:15,345] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 650.36 | backward_microstep: 2146.16 | backward_inner_microstep: 1983.64 | backward_allreduce_microstep: 162.48 | step_microstep: 1079.16
[2023-07-25 21:01:15,349] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 650.35 | backward: 2146.15 | backward_inner: 1983.64 | backward_allreduce: 162.48 | step: 1079.17
samples/sec: 8.009 | iteration 5/ 10000 | elapsed time per iteration (ms): 3995.6 | learning rate: 3.105E-07 | approx flops per GPU: 115.9TFLOPS | lm_loss: 1.120784E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 657.99 | backward: 2146.26 | backward-backward: 2146.23 | backward-allreduce: 0.00 | optimizer: 1086.02 | batch generator: 1.01
[2023-07-25 21:01:20,074] [INFO] [logging.py:96:log_dist] [Rank 0] step=6, skipped=0, lr=[3.725473963076414e-07, 3.725473963076414e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:20,125] [INFO] [timer.py:215:stop] epoch=0/micro_step=6/global_step=6, RunningAvgSamplesPerSec=7.920460284374241, CurrSamplesPerSec=6.861668477031963, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:01:20,125] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 644.97 | backward_microstep: 2149.87 | backward_inner_microstep: 1987.36 | backward_allreduce_microstep: 162.47 | step_microstep: 1862.41
[2023-07-25 21:01:20,125] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 644.96 | backward: 2149.86 | backward_inner: 1987.36 | backward_allreduce: 162.47 | step: 1862.41
samples/sec: 6.699 | iteration 6/ 10000 | elapsed time per iteration (ms): 4777.1 | learning rate: 3.725E-07 | approx flops per GPU: 97.0TFLOPS | lm_loss: 1.118582E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 652.80 | backward: 2149.98 | backward-backward: 2149.92 | backward-allreduce: 0.00 | optimizer: 1862.77 | batch generator: 1.19
[2023-07-25 21:01:23,484] [INFO] [logging.py:96:log_dist] [Rank 0] step=7, skipped=0, lr=[4.346386290255816e-07, 4.346386290255816e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:24,402] [INFO] [timer.py:215:stop] epoch=0/micro_step=7/global_step=7, RunningAvgSamplesPerSec=7.871425345022869, CurrSamplesPerSec=7.681210298123794, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:01:24,404] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 644.25 | backward_microstep: 2152.46 | backward_inner_microstep: 1990.01 | backward_allreduce_microstep: 162.41 | step_microstep: 1365.12
[2023-07-25 21:01:24,404] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 644.24 | backward: 2152.45 | backward_inner: 1990.00 | backward_allreduce: 162.42 | step: 1365.14
samples/sec: 7.476 | iteration 7/ 10000 | elapsed time per iteration (ms): 4280.6 | learning rate: 4.346E-07 | approx flops per GPU: 108.2TFLOPS | lm_loss: 1.118143E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 651.87 | backward: 2152.56 | backward-backward: 2152.52 | backward-allreduce: 0.00 | optimizer: 1365.57 | batch generator: 1.01
[2023-07-25 21:01:28,883] [INFO] [logging.py:96:log_dist] [Rank 0] step=8, skipped=0, lr=[4.967298617435218e-07, 4.967298617435218e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:28,912] [INFO] [timer.py:215:stop] epoch=0/micro_step=8/global_step=8, RunningAvgSamplesPerSec=7.766307530423422, CurrSamplesPerSec=7.280195899274274, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:01:28,913] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 651.09 | backward_microstep: 2152.37 | backward_inner_microstep: 1989.74 | backward_allreduce_microstep: 162.59 | step_microstep: 1585.73
[2023-07-25 21:01:28,913] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 651.08 | backward: 2152.36 | backward_inner: 1989.73 | backward_allreduce: 162.59 | step: 1585.73
samples/sec: 7.101 | iteration 8/ 10000 | elapsed time per iteration (ms): 4506.4 | learning rate: 4.967E-07 | approx flops per GPU: 102.8TFLOPS | lm_loss: 1.118377E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 658.96 | backward: 2152.45 | backward-backward: 2152.42 | backward-allreduce: 0.00 | optimizer: 1586.08 | batch generator: 1.23
[2023-07-25 21:01:32,274] [INFO] [logging.py:96:log_dist] [Rank 0] step=9, skipped=0, lr=[5.58821094461462e-07, 5.58821094461462e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:33,357] [INFO] [timer.py:215:stop] epoch=0/micro_step=9/global_step=9, RunningAvgSamplesPerSec=7.7091896796692705, CurrSamplesPerSec=7.383380171080359, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:01:33,360] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 648.29 | backward_microstep: 2151.29 | backward_inner_microstep: 1988.81 | backward_allreduce_microstep: 162.44 | step_microstep: 1531.21
[2023-07-25 21:01:33,360] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 648.28 | backward: 2151.28 | backward_inner: 1988.80 | backward_allreduce: 162.44 | step: 1531.22
samples/sec: 7.185 | iteration 9/ 10000 | elapsed time per iteration (ms): 4453.5 | learning rate: 5.588E-07 | approx flops per GPU: 104.0TFLOPS | lm_loss: 1.118108E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 655.92 | backward: 2151.37 | backward-backward: 2151.34 | backward-allreduce: 0.00 | optimizer: 1531.69 | batch generator: 1.01
[2023-07-25 21:01:37,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[6.209123271794023e-07, 6.209123271794023e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:37,297] [INFO] [timer.py:215:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=7.786959071097875, CurrSamplesPerSec=8.37861649982396, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:01:37,297] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 650.19 | backward_microstep: 2151.65 | backward_inner_microstep: 1989.15 | backward_allreduce_microstep: 162.47 | step_microstep: 1011.07
[2023-07-25 21:01:37,298] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 650.18 | backward: 2151.64 | backward_inner: 1989.14 | backward_allreduce: 162.47 | step: 1011.08
samples/sec: 8.141 | iteration 10/ 10000 | elapsed time per iteration (ms): 3930.8 | learning rate: 6.209E-07 | approx flops per GPU: 117.8TFLOPS | lm_loss: 1.119786E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 657.62 | backward: 2151.75 | backward-backward: 2151.71 | backward-allreduce: 0.00 | optimizer: 1011.44 | batch generator: 0.79
[2023-07-25 21:01:40,644] [INFO] [logging.py:96:log_dist] [Rank 0] step=11, skipped=0, lr=[6.830035598973424e-07, 6.830035598973424e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:41,921] [INFO] [timer.py:215:stop] epoch=0/micro_step=11/global_step=11, RunningAvgSamplesPerSec=7.702737381175319, CurrSamplesPerSec=7.089327086182841, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:01:41,924] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 648.92 | backward_microstep: 2154.29 | backward_inner_microstep: 1991.88 | backward_allreduce_microstep: 162.37 | step_microstep: 1706.23
[2023-07-25 21:01:41,924] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 648.91 | backward: 2154.28 | backward_inner: 1991.87 | backward_allreduce: 162.37 | step: 1706.23
samples/sec: 6.922 | iteration 11/ 10000 | elapsed time per iteration (ms): 4622.9 | learning rate: 6.830E-07 | approx flops per GPU: 100.2TFLOPS | lm_loss: 1.118007E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 656.91 | backward: 2154.37 | backward-backward: 2154.34 | backward-allreduce: 0.00 | optimizer: 1706.62 | batch generator: 1.35
[2023-07-25 21:01:45,690] [INFO] [logging.py:96:log_dist] [Rank 0] step=12, skipped=0, lr=[7.450947926152828e-07, 7.450947926152828e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:45,741] [INFO] [timer.py:215:stop] epoch=0/micro_step=12/global_step=12, RunningAvgSamplesPerSec=7.785899432471879, CurrSamplesPerSec=8.623861553259447, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:01:45,741] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 645.11 | backward_microstep: 2162.66 | backward_inner_microstep: 2000.16 | backward_allreduce_microstep: 162.46 | step_microstep: 896.46
[2023-07-25 21:01:45,741] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 645.10 | backward: 2162.65 | backward_inner: 2000.15 | backward_allreduce: 162.46 | step: 896.46
samples/sec: 8.377 | iteration 12/ 10000 | elapsed time per iteration (ms): 3820.1 | learning rate: 7.451E-07 | approx flops per GPU: 121.2TFLOPS | lm_loss: 1.119540E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 652.76 | backward: 2162.74 | backward-backward: 2162.71 | backward-allreduce: 0.00 | optimizer: 896.84 | batch generator: 0.91
[2023-07-25 21:01:49,184] [INFO] [logging.py:96:log_dist] [Rank 0] step=13, skipped=0, lr=[8.071860253332229e-07, 8.071860253332229e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:49,368] [INFO] [timer.py:215:stop] epoch=0/micro_step=13/global_step=13, RunningAvgSamplesPerSec=7.889260990909493, CurrSamplesPerSec=9.096920990527913, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:01:49,368] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 643.93 | backward_microstep: 2158.39 | backward_inner_microstep: 1996.01 | backward_allreduce_microstep: 162.34 | step_microstep: 709.00
[2023-07-25 21:01:49,368] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 643.92 | backward: 2158.39 | backward_inner: 1996.01 | backward_allreduce: 162.34 | step: 709.01
samples/sec: 8.828 | iteration 13/ 10000 | elapsed time per iteration (ms): 3624.9 | learning rate: 8.072E-07 | approx flops per GPU: 127.8TFLOPS | lm_loss: 1.119554E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 651.50 | backward: 2158.47 | backward-backward: 2158.45 | backward-allreduce: 0.00 | optimizer: 709.37 | batch generator: 0.94
[2023-07-25 21:01:53,593] [INFO] [logging.py:96:log_dist] [Rank 0] step=14, skipped=0, lr=[8.692772580511632e-07, 8.692772580511632e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:53,646] [INFO] [timer.py:215:stop] epoch=0/micro_step=14/global_step=14, RunningAvgSamplesPerSec=7.874326893011766, CurrSamplesPerSec=7.713707303703314, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:01:53,646] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 643.80 | backward_microstep: 2162.67 | backward_inner_microstep: 2000.27 | backward_allreduce_microstep: 162.37 | step_microstep: 1335.71
[2023-07-25 21:01:53,647] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 643.78 | backward: 2162.66 | backward_inner: 2000.26 | backward_allreduce: 162.37 | step: 1335.72
samples/sec: 7.471 | iteration 14/ 10000 | elapsed time per iteration (ms): 4283.0 | learning rate: 8.693E-07 | approx flops per GPU: 108.1TFLOPS | lm_loss: 1.119902E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 651.19 | backward: 2162.75 | backward-backward: 2162.72 | backward-allreduce: 0.00 | optimizer: 1336.07 | batch generator: 0.80
[2023-07-25 21:01:57,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=15, skipped=0, lr=[9.313684907691033e-07, 9.313684907691033e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:01:57,474] [INFO] [timer.py:215:stop] epoch=0/micro_step=15/global_step=15, RunningAvgSamplesPerSec=7.9264456854471845, CurrSamplesPerSec=8.610328810171882, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:01:57,475] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 652.32 | backward_microstep: 2322.16 | backward_inner_microstep: 1995.75 | backward_allreduce_microstep: 326.38 | step_microstep: 735.71
[2023-07-25 21:01:57,475] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 652.31 | backward: 2322.16 | backward_inner: 1995.74 | backward_allreduce: 326.38 | step: 735.71
samples/sec: 8.364 | iteration 15/ 10000 | elapsed time per iteration (ms): 3825.7 | learning rate: 9.314E-07 | approx flops per GPU: 121.1TFLOPS | lm_loss: 1.119346E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 659.87 | backward: 2322.26 | backward-backward: 2322.22 | backward-allreduce: 0.00 | optimizer: 736.11 | batch generator: 0.93
[2023-07-25 21:02:01,160] [INFO] [logging.py:96:log_dist] [Rank 0] step=16, skipped=0, lr=[9.934597234870435e-07, 9.934597234870435e-07], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:01,735] [INFO] [timer.py:215:stop] epoch=0/micro_step=16/global_step=16, RunningAvgSamplesPerSec=7.910812211981883, CurrSamplesPerSec=7.7130485877991894, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:01,737] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 653.92 | backward_microstep: 2324.93 | backward_inner_microstep: 1995.46 | backward_allreduce_microstep: 329.43 | step_microstep: 1165.53
[2023-07-25 21:02:01,737] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 653.91 | backward: 2324.92 | backward_inner: 1995.46 | backward_allreduce: 329.43 | step: 1165.53
samples/sec: 7.512 | iteration 16/ 10000 | elapsed time per iteration (ms): 4260.0 | learning rate: 9.935E-07 | approx flops per GPU: 108.7TFLOPS | lm_loss: 1.119148E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 661.62 | backward: 2325.02 | backward-backward: 2324.99 | backward-allreduce: 0.00 | optimizer: 1165.89 | batch generator: 1.05
[2023-07-25 21:02:05,798] [INFO] [logging.py:96:log_dist] [Rank 0] step=17, skipped=0, lr=[1.055550956204984e-06, 1.055550956204984e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:06,175] [INFO] [timer.py:215:stop] epoch=0/micro_step=17/global_step=17, RunningAvgSamplesPerSec=7.873752632626502, CurrSamplesPerSec=7.3891330779409286, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:06,175] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 646.47 | backward_microstep: 2418.45 | backward_inner_microstep: 2001.91 | backward_allreduce_microstep: 416.50 | step_microstep: 1259.07
[2023-07-25 21:02:06,175] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 646.46 | backward: 2418.44 | backward_inner: 2001.90 | backward_allreduce: 416.50 | step: 1259.07
samples/sec: 7.211 | iteration 17/ 10000 | elapsed time per iteration (ms): 4437.8 | learning rate: 1.056E-06 | approx flops per GPU: 104.4TFLOPS | lm_loss: 1.119612E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 654.09 | backward: 2418.95 | backward-backward: 2418.90 | backward-allreduce: 0.00 | optimizer: 1259.44 | batch generator: 0.99
[2023-07-25 21:02:10,131] [INFO] [logging.py:96:log_dist] [Rank 0] step=18, skipped=0, lr=[1.117642188922924e-06, 1.117642188922924e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:10,359] [INFO] [timer.py:215:stop] epoch=0/micro_step=18/global_step=18, RunningAvgSamplesPerSec=7.8721691349682, CurrSamplesPerSec=7.848492870305349, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:10,359] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 657.72 | backward_microstep: 2159.66 | backward_inner_microstep: 1997.09 | backward_allreduce_microstep: 162.54 | step_microstep: 1253.82
[2023-07-25 21:02:10,360] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 657.71 | backward: 2159.66 | backward_inner: 1997.08 | backward_allreduce: 162.54 | step: 1253.82
samples/sec: 7.644 | iteration 18/ 10000 | elapsed time per iteration (ms): 4186.4 | learning rate: 1.118E-06 | approx flops per GPU: 110.6TFLOPS | lm_loss: 1.120358E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 665.29 | backward: 2159.75 | backward-backward: 2159.72 | backward-allreduce: 0.00 | optimizer: 1254.22 | batch generator: 0.97
[2023-07-25 21:02:14,503] [INFO] [logging.py:96:log_dist] [Rank 0] step=19, skipped=0, lr=[1.1797334216408642e-06, 1.1797334216408642e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:14,995] [INFO] [timer.py:215:stop] epoch=0/micro_step=19/global_step=19, RunningAvgSamplesPerSec=7.819875552223756, CurrSamplesPerSec=7.068587312961287, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:14,997] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 652.11 | backward_microstep: 2156.45 | backward_inner_microstep: 1993.87 | backward_allreduce_microstep: 162.54 | step_microstep: 1714.23
[2023-07-25 21:02:14,997] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 652.10 | backward: 2156.44 | backward_inner: 1993.87 | backward_allreduce: 162.54 | step: 1714.23
samples/sec: 6.902 | iteration 19/ 10000 | elapsed time per iteration (ms): 4636.3 | learning rate: 1.180E-06 | approx flops per GPU: 99.9TFLOPS | lm_loss: 1.119175E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 659.60 | backward: 2156.53 | backward-backward: 2156.50 | backward-allreduce: 0.00 | optimizer: 1714.60 | batch generator: 0.86
[2023-07-25 21:02:18,692] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[1.2418246543588046e-06, 1.2418246543588046e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:19,296] [INFO] [timer.py:215:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=7.809314109735437, CurrSamplesPerSec=7.634036457321675, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:19,297] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 648.33 | backward_microstep: 2160.51 | backward_inner_microstep: 1998.18 | backward_allreduce_microstep: 162.29 | step_microstep: 1376.67
[2023-07-25 21:02:19,297] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 648.32 | backward: 2160.50 | backward_inner: 1998.18 | backward_allreduce: 162.29 | step: 1376.68
samples/sec: 7.439 | iteration 20/ 10000 | elapsed time per iteration (ms): 4301.7 | learning rate: 1.242E-06 | approx flops per GPU: 107.7TFLOPS | lm_loss: 1.118885E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 656.14 | backward: 2160.59 | backward-backward: 2160.56 | backward-allreduce: 0.00 | optimizer: 1377.05 | batch generator: 1.22
[2023-07-25 21:02:22,714] [INFO] [logging.py:96:log_dist] [Rank 0] step=21, skipped=0, lr=[1.3039158870767448e-06, 1.3039158870767448e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:23,258] [INFO] [timer.py:215:stop] epoch=0/micro_step=21/global_step=21, RunningAvgSamplesPerSec=7.833987096702447, CurrSamplesPerSec=8.306368205838035, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:23,259] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 653.21 | backward_microstep: 2158.49 | backward_inner_microstep: 1996.18 | backward_allreduce_microstep: 162.27 | step_microstep: 1034.48
[2023-07-25 21:02:23,259] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 653.20 | backward: 2158.48 | backward_inner: 1996.17 | backward_allreduce: 162.27 | step: 1034.48
samples/sec: 8.085 | iteration 21/ 10000 | elapsed time per iteration (ms): 3957.9 | learning rate: 1.304E-06 | approx flops per GPU: 117.0TFLOPS | lm_loss: 1.120181E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 660.60 | backward: 2158.57 | backward-backward: 2158.54 | backward-allreduce: 0.00 | optimizer: 1034.85 | batch generator: 0.77
[2023-07-25 21:02:26,925] [INFO] [logging.py:96:log_dist] [Rank 0] step=22, skipped=0, lr=[1.3660071197946848e-06, 1.3660071197946848e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:27,737] [INFO] [timer.py:215:stop] epoch=0/micro_step=22/global_step=22, RunningAvgSamplesPerSec=7.806389606063555, CurrSamplesPerSec=7.316663320453091, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:27,739] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 645.65 | backward_microstep: 2164.15 | backward_inner_microstep: 2001.83 | backward_allreduce_microstep: 162.28 | step_microstep: 1559.32
[2023-07-25 21:02:27,740] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 645.65 | backward: 2164.14 | backward_inner: 2001.82 | backward_allreduce: 162.28 | step: 1559.32
samples/sec: 7.131 | iteration 22/ 10000 | elapsed time per iteration (ms): 4487.2 | learning rate: 1.366E-06 | approx flops per GPU: 103.2TFLOPS | lm_loss: 1.119545E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 653.00 | backward: 2164.22 | backward-backward: 2164.20 | backward-allreduce: 0.00 | optimizer: 1559.69 | batch generator: 0.76
[2023-07-25 21:02:31,076] [INFO] [logging.py:96:log_dist] [Rank 0] step=23, skipped=0, lr=[1.428098352512625e-06, 1.428098352512625e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:31,902] [INFO] [timer.py:215:stop] epoch=0/micro_step=23/global_step=23, RunningAvgSamplesPerSec=7.810841755584336, CurrSamplesPerSec=7.900963495894691, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:31,904] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 644.35 | backward_microstep: 2164.42 | backward_inner_microstep: 2001.93 | backward_allreduce_microstep: 162.45 | step_microstep: 1236.79
[2023-07-25 21:02:31,904] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 644.34 | backward: 2164.41 | backward_inner: 2001.92 | backward_allreduce: 162.45 | step: 1236.80
samples/sec: 7.681 | iteration 23/ 10000 | elapsed time per iteration (ms): 4166.3 | learning rate: 1.428E-06 | approx flops per GPU: 111.2TFLOPS | lm_loss: 1.121456E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 652.12 | backward: 2164.51 | backward-backward: 2164.48 | backward-allreduce: 0.00 | optimizer: 1237.22 | batch generator: 1.12
[2023-07-25 21:02:35,286] [INFO] [logging.py:96:log_dist] [Rank 0] step=24, skipped=0, lr=[1.4901895852305655e-06, 1.4901895852305655e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:35,760] [INFO] [timer.py:215:stop] epoch=0/micro_step=24/global_step=24, RunningAvgSamplesPerSec=7.841662302058306, CurrSamplesPerSec=8.550155710768728, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:35,762] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 646.01 | backward_microstep: 2166.18 | backward_inner_microstep: 2003.90 | backward_allreduce_microstep: 162.23 | step_microstep: 926.15
[2023-07-25 21:02:35,763] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 646.00 | backward: 2166.17 | backward_inner: 2003.90 | backward_allreduce: 162.23 | step: 926.16
samples/sec: 8.302 | iteration 24/ 10000 | elapsed time per iteration (ms): 3854.5 | learning rate: 1.490E-06 | approx flops per GPU: 120.2TFLOPS | lm_loss: 1.117656E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 653.51 | backward: 2166.28 | backward-backward: 2166.25 | backward-allreduce: 0.00 | optimizer: 926.59 | batch generator: 0.87
[2023-07-25 21:02:39,656] [INFO] [logging.py:96:log_dist] [Rank 0] step=25, skipped=0, lr=[1.5522808179485057e-06, 1.5522808179485057e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:39,708] [INFO] [timer.py:215:stop] epoch=0/micro_step=25/global_step=25, RunningAvgSamplesPerSec=7.862151859402401, CurrSamplesPerSec=8.3416642417714, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:39,708] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 653.01 | backward_microstep: 2160.01 | backward_inner_microstep: 1997.64 | backward_allreduce_microstep: 162.33 | step_microstep: 1016.80
[2023-07-25 21:02:39,709] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 653.00 | backward: 2160.00 | backward_inner: 1997.63 | backward_allreduce: 162.33 | step: 1016.80
samples/sec: 8.115 | iteration 25/ 10000 | elapsed time per iteration (ms): 3943.4 | learning rate: 1.552E-06 | approx flops per GPU: 117.4TFLOPS | lm_loss: 1.118580E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 660.63 | backward: 2160.10 | backward-backward: 2160.07 | backward-allreduce: 0.00 | optimizer: 1017.16 | batch generator: 1.00
[2023-07-25 21:02:43,099] [INFO] [logging.py:96:log_dist] [Rank 0] step=26, skipped=0, lr=[1.6143720506664457e-06, 1.6143720506664457e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:43,796] [INFO] [timer.py:215:stop] epoch=0/micro_step=26/global_step=26, RunningAvgSamplesPerSec=7.8693415082764115, CurrSamplesPerSec=8.038410622045285, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:43,796] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 652.92 | backward_microstep: 2162.25 | backward_inner_microstep: 1999.88 | backward_allreduce_microstep: 162.33 | step_microstep: 1159.39
[2023-07-25 21:02:43,797] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 652.91 | backward: 2162.24 | backward_inner: 1999.87 | backward_allreduce: 162.33 | step: 1159.40
samples/sec: 7.817 | iteration 26/ 10000 | elapsed time per iteration (ms): 4093.6 | learning rate: 1.614E-06 | approx flops per GPU: 113.1TFLOPS | lm_loss: 1.119049E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 660.31 | backward: 2162.33 | backward-backward: 2162.31 | backward-allreduce: 0.00 | optimizer: 1159.76 | batch generator: 0.76
[2023-07-25 21:02:47,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=27, skipped=0, lr=[1.6764632833843862e-06, 1.6764632833843862e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:47,866] [INFO] [timer.py:215:stop] epoch=0/micro_step=27/global_step=27, RunningAvgSamplesPerSec=7.877800023656104, CurrSamplesPerSec=8.08640392337608, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:47,866] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 652.05 | backward_microstep: 2161.14 | backward_inner_microstep: 1998.67 | backward_allreduce_microstep: 162.43 | step_microstep: 1137.73
[2023-07-25 21:02:47,867] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 652.04 | backward: 2161.13 | backward_inner: 1998.66 | backward_allreduce: 162.43 | step: 1137.73
samples/sec: 7.875 | iteration 27/ 10000 | elapsed time per iteration (ms): 4063.5 | learning rate: 1.676E-06 | approx flops per GPU: 114.0TFLOPS | lm_loss: 1.117566E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 659.52 | backward: 2161.22 | backward-backward: 2161.19 | backward-allreduce: 0.00 | optimizer: 1138.08 | batch generator: 0.84
[2023-07-25 21:02:51,258] [INFO] [logging.py:96:log_dist] [Rank 0] step=28, skipped=0, lr=[1.7385545161023264e-06, 1.7385545161023264e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:52,101] [INFO] [timer.py:215:stop] epoch=0/micro_step=28/global_step=28, RunningAvgSamplesPerSec=7.872831606147124, CurrSamplesPerSec=7.750626334778097, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:52,101] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 653.35 | backward_microstep: 2162.64 | backward_inner_microstep: 2000.28 | backward_allreduce_microstep: 162.32 | step_microstep: 1306.44
[2023-07-25 21:02:52,102] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 653.34 | backward: 2162.63 | backward_inner: 2000.27 | backward_allreduce: 162.32 | step: 1306.45
samples/sec: 7.554 | iteration 28/ 10000 | elapsed time per iteration (ms): 4236.3 | learning rate: 1.739E-06 | approx flops per GPU: 109.3TFLOPS | lm_loss: 1.119163E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 660.80 | backward: 2162.72 | backward-backward: 2162.70 | backward-allreduce: 0.00 | optimizer: 1306.82 | batch generator: 0.82
[2023-07-25 21:02:55,949] [INFO] [logging.py:96:log_dist] [Rank 0] step=29, skipped=0, lr=[1.8006457488202664e-06, 1.8006457488202664e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:02:56,055] [INFO] [timer.py:215:stop] epoch=0/micro_step=29/global_step=29, RunningAvgSamplesPerSec=7.888491962265259, CurrSamplesPerSec=8.318721897419659, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:02:56,056] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 649.15 | backward_microstep: 2164.85 | backward_inner_microstep: 2002.42 | backward_allreduce_microstep: 162.39 | step_microstep: 1026.44
[2023-07-25 21:02:56,056] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 649.14 | backward: 2164.84 | backward_inner: 2002.41 | backward_allreduce: 162.39 | step: 1026.45
samples/sec: 8.094 | iteration 29/ 10000 | elapsed time per iteration (ms): 3953.4 | learning rate: 1.801E-06 | approx flops per GPU: 117.2TFLOPS | lm_loss: 1.119831E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 656.54 | backward: 2164.92 | backward-backward: 2164.90 | backward-allreduce: 0.00 | optimizer: 1026.81 | batch generator: 0.78
[2023-07-25 21:02:59,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[1.8627369815382066e-06, 1.8627369815382066e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:03:00,515] [INFO] [timer.py:215:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=7.86798048242359, CurrSamplesPerSec=7.351845669318769, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:03:00,515] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 652.65 | backward_microstep: 2164.72 | backward_inner_microstep: 2002.41 | backward_allreduce_microstep: 162.28 | step_microstep: 1529.50
[2023-07-25 21:03:00,516] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 652.64 | backward: 2164.71 | backward_inner: 2002.40 | backward_allreduce: 162.28 | step: 1529.50
samples/sec: 7.172 | iteration 30/ 10000 | elapsed time per iteration (ms): 4461.8 | learning rate: 1.863E-06 | approx flops per GPU: 103.8TFLOPS | lm_loss: 1.118670E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 659.96 | backward: 2164.81 | backward-backward: 2164.78 | backward-allreduce: 0.00 | optimizer: 1529.95 | batch generator: 0.70
[2023-07-25 21:03:04,366] [INFO] [logging.py:96:log_dist] [Rank 0] step=31, skipped=0, lr=[1.9248282142561473e-06, 1.9248282142561473e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:03:04,830] [INFO] [timer.py:215:stop] epoch=0/micro_step=31/global_step=31, RunningAvgSamplesPerSec=7.858730012513352, CurrSamplesPerSec=7.60826661060043, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:03:04,830] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 652.71 | backward_microstep: 2162.19 | backward_inner_microstep: 1999.82 | backward_allreduce_microstep: 162.33 | step_microstep: 1384.75
[2023-07-25 21:03:04,831] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 652.70 | backward: 2162.19 | backward_inner: 1999.82 | backward_allreduce: 162.33 | step: 1384.75
samples/sec: 7.422 | iteration 31/ 10000 | elapsed time per iteration (ms): 4311.5 | learning rate: 1.925E-06 | approx flops per GPU: 107.4TFLOPS | lm_loss: 1.119522E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 660.41 | backward: 2162.28 | backward-backward: 2162.25 | backward-allreduce: 0.00 | optimizer: 1385.11 | batch generator: 1.06
[2023-07-25 21:03:08,239] [INFO] [logging.py:96:log_dist] [Rank 0] step=32, skipped=0, lr=[1.986919446974087e-06, 1.986919446974087e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:03:09,023] [INFO] [timer.py:215:stop] epoch=0/micro_step=32/global_step=32, RunningAvgSamplesPerSec=7.857739327479001, CurrSamplesPerSec=7.829117717811468, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:03:09,023] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 653.51 | backward_microstep: 2164.57 | backward_inner_microstep: 2002.29 | backward_allreduce_microstep: 162.24 | step_microstep: 1262.99
[2023-07-25 21:03:09,023] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 653.50 | backward: 2164.56 | backward_inner: 2002.28 | backward_allreduce: 162.24 | step: 1262.99
samples/sec: 7.627 | iteration 32/ 10000 | elapsed time per iteration (ms): 4195.9 | learning rate: 1.987E-06 | approx flops per GPU: 110.4TFLOPS | lm_loss: 1.118780E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 660.97 | backward: 2164.65 | backward-backward: 2164.62 | backward-allreduce: 0.00 | optimizer: 1263.37 | batch generator: 0.85
[2023-07-25 21:03:13,137] [INFO] [logging.py:96:log_dist] [Rank 0] step=33, skipped=0, lr=[2.0490106796920273e-06, 2.0490106796920273e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:03:13,166] [INFO] [timer.py:215:stop] epoch=0/micro_step=33/global_step=33, RunningAvgSamplesPerSec=7.860096594594543, CurrSamplesPerSec=7.931478242904953, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:03:13,166] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 649.05 | backward_microstep: 2161.46 | backward_inner_microstep: 1999.11 | backward_allreduce_microstep: 162.31 | step_microstep: 1217.75
[2023-07-25 21:03:13,166] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 649.04 | backward: 2161.45 | backward_inner: 1999.10 | backward_allreduce: 162.31 | step: 1217.76
samples/sec: 7.726 | iteration 33/ 10000 | elapsed time per iteration (ms): 4141.9 | learning rate: 2.049E-06 | approx flops per GPU: 111.8TFLOPS | lm_loss: 1.118077E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 656.55 | backward: 2161.55 | backward-backward: 2161.52 | backward-allreduce: 0.00 | optimizer: 1218.10 | batch generator: 0.88
[2023-07-25 21:03:16,574] [INFO] [logging.py:96:log_dist] [Rank 0] step=34, skipped=0, lr=[2.111101912409968e-06, 2.111101912409968e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:03:17,269] [INFO] [timer.py:215:stop] epoch=0/micro_step=34/global_step=34, RunningAvgSamplesPerSec=7.864669709950182, CurrSamplesPerSec=8.00912418308922, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:03:17,269] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 652.42 | backward_microstep: 2164.53 | backward_inner_microstep: 2001.89 | backward_allreduce_microstep: 162.60 | step_microstep: 1172.13
[2023-07-25 21:03:17,269] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 652.41 | backward: 2164.52 | backward_inner: 2001.89 | backward_allreduce: 162.60 | step: 1172.13
samples/sec: 7.799 | iteration 34/ 10000 | elapsed time per iteration (ms): 4103.1 | learning rate: 2.111E-06 | approx flops per GPU: 112.9TFLOPS | lm_loss: 1.118665E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 659.86 | backward: 2164.69 | backward-backward: 2164.66 | backward-allreduce: 0.00 | optimizer: 1172.51 | batch generator: 0.83
[2023-07-25 21:03:21,037] [INFO] [logging.py:96:log_dist] [Rank 0] step=35, skipped=0, lr=[2.1731931451279077e-06, 2.1731931451279077e-06], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-07-25 21:03:21,492] [INFO] [timer.py:215:stop] epoch=0/micro_step=35/global_step=35, RunningAvgSamplesPerSec=7.861909484389783, CurrSamplesPerSec=7.77459389773832, MemAllocated=59.65GB, MaxMemAllocated=63.31GB
[2023-07-25 21:03:21,493] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 654.02 | backward_microstep: 2163.26 | backward_inner_microstep: 2000.73 | backward_allreduce_microstep: 162.50 | step_microstep: 1292.39
[2023-07-25 21:03:21,493] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 654.01 | backward: 2163.26 | backward_inner: 2000.72 | backward_allreduce: 162.50 | step: 1292.39
samples/sec: 7.578 | iteration 35/ 10000 | elapsed time per iteration (ms): 4222.9 | learning rate: 2.173E-06 | approx flops per GPU: 109.7TFLOPS | lm_loss: 1.117811E+01 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 661.53 | backward: 2163.35 | backward-backward: 2163.32 | backward-allreduce: 0.00 | optimizer: 1292.77 | batch generator: 0.89
[2023-07-25 21:03:24,896] [INFO] [logging.py:96:log_dist] [Rank 0] step=36, skipped=0, lr=[2.235284377845848e-06, 2.235284377845848e-06], mom=[[0.9, 0.95], [0.9, 0.95]]