DeepSpeed
DeepSpeed copied to clipboard
How can DeepSpeed be configured to prevent the merging of parameter groups
The optimizer has been re-implemented to group parameters and set different learning rates for each group. However, after using DeepSpeed, all the param_groups are merged into one. How can this be prevented?
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupCosineLR",
"params": {
"total_num_steps": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"offload_param": {
"device": "none",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
@CLL112, DeepSpeed already supports this request. For example, we don't merge weights and biases, which are typically implemented as different param groups. It would be helpful to have a full repro for us to understand what is going with your scenario.
@CLL112, DeepSpeed already supports this request. For example, we don't merge weights and biases, which are typically implemented as different param groups. It would be helpful to have a full repro for us to understand what is going with your scenario.
I have rewritten the optimizers and separately set the learning rate for the act_fn in the model. During training, it works well when not using DeepSpeed, but after using DeepSpeed, I found that it doesn’t work:
decay_parameters = Trainer.get_decay_parameter_names(None, model)
optimizer_grouped_parameters = [
{
'params': [p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
'act_fn' not in n)],
'weight_decay': args.weight_decay,
'lr': args.learning_rate, # Default learning rate
},
{
'params': [p for n, p in model.named_parameters() if (n not in decay_parameters and p.requires_grad)],
'weight_decay': 0.0,
'lr': args.learning_rate, # Default learning rate
},
{
'params': [p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
'act_fn' in n)],
'weight_decay': 0.0,
'lr': 0.5, # Custom learning rate for act_fn
},
]
optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(args)
optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
# Debugging optimizer parameter groups
for i, param_group in enumerate(optimizer.param_groups):
print(f"Param group {i}: lr={param_group.get('lr', args.learning_rate)}, "
f"weight_decay={param_group['weight_decay']}")
I printed the relevant parameters:
Param group 0: lr=5e-05, weight_decay=0.1
Param group 1: lr=5e-05, weight_decay=0.0
Param group 2: lr=0.5, weight_decay=0.0
However, in transformer.trainer, after self.optimizer.step(), I also checked it with:
self.optimizer.step()
for i, param_group in enumerate(self.optimizer.optimizer.param_groups):
print(f"Param group {i}: lr={param_group.get('lr', args.learning_rate)}, "
f"weight_decay={param_group['weight_decay']}")
The output is:
Param group 0: lr=5e-05, weight_decay=0.1
This is strange; there are no Param group 1 and 2. I am using DeepSpeed’s Zero3. Does this change the Param group? The configuration for ZeRO-3 is as follows:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupCosineLR",
"params": {
"total_num_steps": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"offload_param": {
"device": "none",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"zero_allow_untested_optimizer": true,
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
@CLL112, can you please share simple but full repro code to debug?
I have the same problem. Is there any way to solve this problem?
@onaka-ga-pkpk unfortunately, we are unable to investigate due to lack of repro. Are you able to provide a repro? Thanks