megatron optimizer precision setting
verl上megatron_workers里面optimizer精度只支持bf16和fp16,之前的版本直接通过硬编码的形式设置成bf16。但是训练中想稳定optimizer的master weight应该是fp32,目前看并不是这样,能解释原因吗? : # TODO: add more optimizer args into config if self._is_actor: optim_config_megatron = init_megatron_optim_config(optim_config, fp16=self.dtype == torch.float16) actor_optimizer = get_megatron_optimizer(model=actor_module, config=optim_config_megatron) actor_optimizer_scheduler = get_megatron_optimizer_param_scheduler( optimizer=actor_optimizer, config=optim_config )
fsdp上的混合精度是设置是这样的: if mixed_precision_config is not None: param_dtype = PrecisionType.to_dtype(mixed_precision_config.get("param_dtype", "bf16")) reduce_dtype = PrecisionType.to_dtype(mixed_precision_config.get("reduce_dtype", "fp32")) buffer_dtype = PrecisionType.to_dtype(mixed_precision_config.get("buffer_dtype", "fp32")) else: param_dtype = torch.bfloat16 reduce_dtype = torch.float32 buffer_dtype = torch.float32
mixed_precision = MixedPrecision(param_dtype=param_dtype, reduce_dtype=reduce_dtype, buffer_dtype=buffer_dtype)
optimizer的精度应该跟加载模型时候的参数精度有关,默认配置是fp32,也就是说megatron和fsdp的optimizer precision没对齐吗? if role == "actor" and optim_config is not None: from verl.utils.torch_functional import get_constant_schedule_with_warmup, get_cosine_schedule_with_warmup
actor_optimizer = build_optimizer(actor_module_fsdp.parameters(), optim_config)