DeepSpeed How can DeepSpeed be configured to prevent the merging of parameter groups

The optimizer has been re-implemented to group parameters and set different learning rates for each group. However, after using DeepSpeed, all the param_groups are merged into one. How can this be prevented?

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupCosineLR",
        "params": {
            "total_num_steps": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Dec 16 '24 14:12 Polarisamoon

@CLL112, DeepSpeed already supports this request. For example, we don't merge weights and biases, which are typically implemented as different param groups. It would be helpful to have a full repro for us to understand what is going with your scenario.

Dec 16 '24 18:12 tjruwase

@CLL112, DeepSpeed already supports this request. For example, we don't merge weights and biases, which are typically implemented as different param groups. It would be helpful to have a full repro for us to understand what is going with your scenario.

I have rewritten the optimizers and separately set the learning rate for the act_fn in the model. During training, it works well when not using DeepSpeed, but after using DeepSpeed, I found that it doesn’t work:

decay_parameters = Trainer.get_decay_parameter_names(None, model)
optimizer_grouped_parameters = [
    {
        'params': [p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
                                                               'act_fn' not in n)],
        'weight_decay': args.weight_decay,
        'lr': args.learning_rate,  # Default learning rate
    },
    {
        'params': [p for n, p in model.named_parameters() if (n not in decay_parameters and p.requires_grad)],
        'weight_decay': 0.0,
        'lr': args.learning_rate,  # Default learning rate
    },
    {
        'params': [p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
                                                               'act_fn' in n)],
        'weight_decay': 0.0,
        'lr': 0.5,  # Custom learning rate for act_fn
    },
]

optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(args)
optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)

# Debugging optimizer parameter groups
for i, param_group in enumerate(optimizer.param_groups):
    print(f"Param group {i}: lr={param_group.get('lr', args.learning_rate)}, "
          f"weight_decay={param_group['weight_decay']}")

I printed the relevant parameters:

Param group 0: lr=5e-05, weight_decay=0.1
Param group 1: lr=5e-05, weight_decay=0.0
Param group 2: lr=0.5, weight_decay=0.0

However, in transformer.trainer, after self.optimizer.step(), I also checked it with:

self.optimizer.step()
for i, param_group in enumerate(self.optimizer.optimizer.param_groups):
    print(f"Param group {i}: lr={param_group.get('lr', args.learning_rate)}, "
          f"weight_decay={param_group['weight_decay']}")

The output is:

Param group 0: lr=5e-05, weight_decay=0.1

This is strange; there are no Param group 1 and 2. I am using DeepSpeed’s Zero3. Does this change the Param group? The configuration for ZeRO-3 is as follows:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupCosineLR",
        "params": {
            "total_num_steps": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "zero_allow_untested_optimizer": true,
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Dec 17 '24 03:12 Polarisamoon

@CLL112, can you please share simple but full repro code to debug?

Dec 18 '24 11:12 tjruwase

I have the same problem. Is there any way to solve this problem?

Jul 07 '25 13:07 onaka-ga-pkpk

@onaka-ga-pkpk unfortunately, we are unable to investigate due to lack of repro. Are you able to provide a repro? Thanks

Jul 07 '25 16:07 tjruwase

DeepSpeed DeepSpeed copied to clipboard

How can DeepSpeed be configured to prevent the merging of parameter groups

DeepSpeed
DeepSpeed copied to clipboard