DeepSpeed flops_profiler: add option recompute_fwd for the case of activation c…

trafficstars

…heckpoint

When activation checkpointing is enabled, most of forward is re-computed, and so the FLOPS calculation should be updated.

In this PR, we just suppose all fwd is re-computed as an estimation.

I don't find a way to pass the option from model script to deepspeed engine, and so add option directly for flops_profiler.

Apr 24 '23 08:04 guoyejun

with this PR, if the recompute_fwd is true, the output changes from

bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: ... TFLOPS fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): ... TFLOPS FLOPS per GPU = 3 * fwd flops per GPU / iter latency: ... TFLOPS

to

bwd FLOPS per GPU = 3 * fwd flops per GPU / bwd latency: ... TFLOPS fwd+bwd FLOPS per GPU = 4 * fwd flops per GPU / (fwd+bwd latency): ... TFLOPS FLOPS per GPU = 4 * fwd flops per GPU / iter latency: ... TFLOPS

It is more accurate with this PR.

Apr 26 '23 02:04 guoyejun

just have another idea, shall we change "recompute_fwd" from bool to float among [0.0, 1.0] (and also rename it as recompute_fwd_factor), since not all gemms are recomputed in the cases such as seq parallel in megatron-lm.

for the current implementation in megatron-deepspeed, we can set recompute_fwd_factor as 1.0.

comment are welcome, thanks.

Apr 26 '23 08:04 guoyejun

any comment? thanks.

Apr 28 '23 01:04 guoyejun

@jeffra @tjruwase @cli99 any comment? thanks.

May 06 '23 01:05 guoyejun

just have another idea, shall we change "recompute_fwd" from bool to float among [0.0, 1.0] (and also rename it as recompute_fwd_factor), since not all gemms are recomputed in the cases such as seq parallel in megatron-lm.

for the current implementation in megatron-deepspeed, we can set recompute_fwd_factor as 1.0.

comment are welcome, thanks.

plan to change to this idea if no other comment, thanks.

May 12 '23 02:05 guoyejun

just have another idea, shall we change "recompute_fwd" from bool to float among [0.0, 1.0] (and also rename it as recompute_fwd_factor), since not all gemms are recomputed in the cases such as seq parallel in megatron-lm. for the current implementation in megatron-deepspeed, we can set recompute_fwd_factor as 1.0. comment are welcome, thanks.

plan to change to this idea if no other comment, thanks.

recompute_fwd_factor is used in the latest commit, could you please take a review, thanks. @jeffra @tjruwase @cli99

May 15 '23 12:05 yejunguo

@guoyejun thanks, recompute_fwd_factor is good to have.

May 22 '23 22:05 cli99

DeepSpeed DeepSpeed copied to clipboard

flops_profiler: add option recompute_fwd for the case of activation c…

DeepSpeed
DeepSpeed copied to clipboard