DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Parameter fusion in optimizer partition makes lamb behaves differently

Open szhengac opened this issue 3 years ago • 5 comments

In optimizer partition, the parameters are fused into a big vector and then get partitioned over workers. So the number of chunks can be much lesser than the number of layers. This is ok for SGD or Adam. But for other optimizers such as LARS and LAMB, the behaviours will be different, as they need to compute a layer-wise scaling factor. As the partition is different, we may not be able to reuse the hyper-parameters tuned in other frameworks.

szhengac avatar Oct 28 '20 22:10 szhengac

@szhengac You are correct, LAMB and LARS implementations that are not aware of ZeRO will not work correctly with ZeRO. This is not a fundamental limitation of optimizer partitioning though, rather a limitation of the current implementations of LAMB and LAR that are not designed to work with the optimizer partitioning.

By identifying layer boundaries in the big vector partitioned over workers, we can create a LAMB or LARS implementation that works with ZeRO. This has been on our to do list, but we haven't had the bandwidth to do it. We would be very happy to receive contributions along this line.

samyam avatar Oct 30 '20 17:10 samyam

@szhengac You are correct, LAMB and LARS implementations that are not aware of ZeRO will not work correctly with ZeRO. This is not a fundamental limitation of optimizer partitioning though, rather a limitation of the current implementations of LAMB and LAR that are not designed to work with the optimizer partitioning.

By identifying layer boundaries in the big vector partitioned over workers, we can create a LAMB or LARS implementation that works with ZeRO. This has been on our to do list, but we haven't had the bandwidth to do it. We would be very happy to receive contributions along this line.

So why does FusedLAMB in ZeRO work?

yanring avatar Nov 25 '20 05:11 yanring

@szhengac You are correct, LAMB and LARS implementations that are not aware of ZeRO will not work correctly with ZeRO. This is not a fundamental limitation of optimizer partitioning though, rather a limitation of the current implementations of LAMB and LAR that are not designed to work with the optimizer partitioning. By identifying layer boundaries in the big vector partitioned over workers, we can create a LAMB or LARS implementation that works with ZeRO. This has been on our to do list, but we haven't had the bandwidth to do it. We would be very happy to receive contributions along this line.

So why does FusedLAMB in ZeRO work?

hi,how do you check FusedLAMB in ZeRO work? You have done training with zero + lamb in the actual model. The convergence of the model is normal, right? Looking forward to your reply

gongjingcs avatar Jan 04 '21 02:01 gongjingcs

@szhengac You are correct, LAMB and LARS implementations that are not aware of ZeRO will not work correctly with ZeRO. This is not a fundamental limitation of optimizer partitioning though, rather a limitation of the current implementations of LAMB and LAR that are not designed to work with the optimizer partitioning. By identifying layer boundaries in the big vector partitioned over workers, we can create a LAMB or LARS implementation that works with ZeRO. This has been on our to do list, but we haven't had the bandwidth to do it. We would be very happy to receive contributions along this line.

So why does FusedLAMB in ZeRO work?

hi,how do you check FusedLAMB in ZeRO work? You have done training with zero + lamb in the actual model. The convergence of the model is normal, right? Looking forward to your reply

I've tried training with lamb+zero2 in a big batch_size(8k), and it's convergence didn't perform well compared with adam. I hold this issue maybe a major cause.

Further I wanna try doing param_norm before grad_reducing in zero2 and see if this can help solve this problem.

If deepspeed solve this issue originally, I'd be happy if you could tell me! @samyam

Kite0011 avatar Aug 08 '22 07:08 Kite0011

@szhengac You are correct, LAMB and LARS implementations that are not aware of ZeRO will not work correctly with ZeRO. This is not a fundamental limitation of optimizer partitioning though, rather a limitation of the current implementations of LAMB and LAR that are not designed to work with the optimizer partitioning. By identifying layer boundaries in the big vector partitioned over workers, we can create a LAMB or LARS implementation that works with ZeRO. This has been on our to do list, but we haven't had the bandwidth to do it. We would be very happy to receive contributions along this line.

So why does FusedLAMB in ZeRO work?

hi,how do you check FusedLAMB in ZeRO work? You have done training with zero + lamb in the actual model. The convergence of the model is normal, right? Looking forward to your reply

I've tried training with lamb+zero2 in a big batch_size(8k), and it's convergence didn't perform well compared with adam. I hold this issue maybe a major cause.

Further I wanna try doing param_norm before grad_reducing in zero2 and see if this can help solve this problem.

If deepspeed solve this issue originally, I'd be happy if you could tell me! @samyam

I just remember that I can't scale gradient simplely. Maybe I will send a [[value, offset], ...] array into optimizer to scale final learning rate.

Kite0011 avatar Aug 08 '22 08:08 Kite0011

the software issue has been solved.

xiexbing avatar Sep 14 '23 19:09 xiexbing