DeepSpeed
DeepSpeed copied to clipboard
Bad performance when there are lots of optim_groups (for example, using layer-wise learning rate)
DeepSpeed is much slower when there are lots of optim_groups in FusedAdam.
For example, when you are using layer-wise learning rate for a model with 100+ layers.
In that case, you will see 100+ "partitions" in each "Rank", and the training speed is much worse.
@BlinkDL, can you provide details to repro?