DeepSpeed Bad performance when there are lots of optim_groups (for example, using layer-wise learning rate)

Bad performance when there are lots of optim_groups (for example, using layer-wise learning rate)

Open BlinkDL opened this issue 3 years ago • 1 comments

DeepSpeed is much slower when there are lots of optim_groups in FusedAdam.

For example, when you are using layer-wise learning rate for a model with 100+ layers.

In that case, you will see 100+ "partitions" in each "Rank", and the training speed is much worse.

Jul 22 '22 05:07 BlinkDL

@BlinkDL, can you provide details to repro?

Jul 27 '22 15:07 tjruwase