DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

How can DeepSpeed be configured to prevent the merging of parameter groups

Open Polarisamoon opened this issue 11 months ago • 4 comments

The optimizer has been re-implemented to group parameters and set different learning rates for each group. However, after using DeepSpeed, all the param_groups are merged into one. How can this be prevented?

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupCosineLR",
        "params": {
            "total_num_steps": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Polarisamoon avatar Dec 16 '24 14:12 Polarisamoon

@CLL112, DeepSpeed already supports this request. For example, we don't merge weights and biases, which are typically implemented as different param groups. It would be helpful to have a full repro for us to understand what is going with your scenario.

tjruwase avatar Dec 16 '24 18:12 tjruwase

@CLL112, DeepSpeed already supports this request. For example, we don't merge weights and biases, which are typically implemented as different param groups. It would be helpful to have a full repro for us to understand what is going with your scenario.

I have rewritten the optimizers and separately set the learning rate for the act_fn in the model. During training, it works well when not using DeepSpeed, but after using DeepSpeed, I found that it doesn’t work:

decay_parameters = Trainer.get_decay_parameter_names(None, model)
optimizer_grouped_parameters = [
    {
        'params': [p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
                                                               'act_fn' not in n)],
        'weight_decay': args.weight_decay,
        'lr': args.learning_rate,  # Default learning rate
    },
    {
        'params': [p for n, p in model.named_parameters() if (n not in decay_parameters and p.requires_grad)],
        'weight_decay': 0.0,
        'lr': args.learning_rate,  # Default learning rate
    },
    {
        'params': [p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
                                                               'act_fn' in n)],
        'weight_decay': 0.0,
        'lr': 0.5,  # Custom learning rate for act_fn
    },
]

optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(args)
optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)

# Debugging optimizer parameter groups
for i, param_group in enumerate(optimizer.param_groups):
    print(f"Param group {i}: lr={param_group.get('lr', args.learning_rate)}, "
          f"weight_decay={param_group['weight_decay']}")

I printed the relevant parameters:

Param group 0: lr=5e-05, weight_decay=0.1
Param group 1: lr=5e-05, weight_decay=0.0
Param group 2: lr=0.5, weight_decay=0.0

However, in transformer.trainer, after self.optimizer.step(), I also checked it with:

self.optimizer.step()
for i, param_group in enumerate(self.optimizer.optimizer.param_groups):
    print(f"Param group {i}: lr={param_group.get('lr', args.learning_rate)}, "
          f"weight_decay={param_group['weight_decay']}")

The output is:

Param group 0: lr=5e-05, weight_decay=0.1

This is strange; there are no Param group 1 and 2. I am using DeepSpeed’s Zero3. Does this change the Param group? The configuration for ZeRO-3 is as follows:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupCosineLR",
        "params": {
            "total_num_steps": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "zero_allow_untested_optimizer": true,
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Polarisamoon avatar Dec 17 '24 03:12 Polarisamoon

@CLL112, can you please share simple but full repro code to debug?

tjruwase avatar Dec 18 '24 11:12 tjruwase

I have the same problem. Is there any way to solve this problem?

onaka-ga-pkpk avatar Jul 07 '25 13:07 onaka-ga-pkpk

@onaka-ga-pkpk unfortunately, we are unable to investigate due to lack of repro. Are you able to provide a repro? Thanks

tjruwase avatar Jul 07 '25 16:07 tjruwase