DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

How do I log the result of `wall_clock_breakdown` from all ranks?

Open jaywonchung opened this issue 3 years ago • 0 comments

Hi!

I'm trying out the pipeline parallelism example in microsoft/DeepSpeedExamples here on DeepSpeed commit ebed51df787579b8b3f836bb155ce0bf97f4ab66.

I turned on wall_clock_breakdown in ds_config.json and have four pipeline stages across four A40 GPUs. Training runs well.

However, I was wondering if there's an option to have all four pipeline stages report their time breakdowns. This is an excerpt of the output I'm currently getting, which looks like only rank 0 is reporting the computation & communication time for its own stage (i.e. first stage of the pipeline):

steps: 20 loss: 4.6178 iter time (s): 0.994 samples/sec: 257.601
[2022-08-25 02:58:04,341] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 41.71 | pipe_recv_grad: 1725.64
[2022-08-25 02:58:13,587] [INFO] [logging.py:68:log_dist] [Rank 0] step=30, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
[2022-08-25 02:58:14,298] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 209.59 | forward_microstep: 2785.85 | backward_microstep: 5153.22 | backward_inner_microstep: 5153.13 | backward_allreduce_microstep: 0.00 | step_microstep: 12.50
[2022-08-25 02:58:14,299] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 2785.85 | backward: 5153.18 | backward_inner: 5153.08 | backward_allreduce: 0.00 | step: 12.50
steps: 30 loss: 5.8922 iter time (s): 0.995 samples/sec: 257.271
[2022-08-25 02:58:14,300] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 41.68 | pipe_recv_grad: 1725.21
[2022-08-25 02:58:23,540] [INFO] [logging.py:68:log_dist] [Rank 0] step=40, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
[2022-08-25 02:58:24,251] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 204.02 | forward_microstep: 2786.27 | backward_microstep: 5153.85 | backward_inner_microstep: 5153.75 | backward_allreduce_microstep: 0.00 | step_microstep: 12.51
[2022-08-25 02:58:24,252] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 2786.28 | backward: 5153.80 | backward_inner: 5153.71 | backward_allreduce: 0.00 | step: 12.51

Thanks a lot.

jaywonchung avatar Aug 25 '22 23:08 jaywonchung