Yu Chin Fabian Lim

Results 4 issues of Yu Chin Fabian Lim

# What does this PR do? Fixes #29425. Also please refer to the accompanying PR https://github.com/huggingface/accelerate/pull/2531 which implements an extra control `sync_each_batch` for `GradientAccumulationPlugin`. Before these changes, `GradientAccumulationPlugin` is configured...

# What does this PR do? Currently in `FullyShardedDataParallelPlugin`, the `param_init_fn` [is set when sync_module_states=True](https://github.com/huggingface/accelerate/blob/6f79b63b865a33b92a3f1c9e2562b88ee7a4d89d/src/accelerate/utils/dataclasses.py#L1699). This is required by `FSDP` to initialize the shards (i.e. rank > 0) params in...

### Your current environment The output of `python collect_env.py` ```text PyTorch version: 2.5.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A OS:...

bug

I noticed there is some scaling with `loss_scale_factor` in the `dp_actor`: - [pg_loss](https://github.com/volcengine/verl/blob/afd759789bd8b80b692361ca971758a0f34d75da/verl/workers/actor/dp_actor.py#L477) - [kl_loss](https://github.com/volcengine/verl/blob/afd759789bd8b80b692361ca971758a0f34d75da/verl/workers/actor/dp_actor.py#L465) ```python "actor/pg_loss": pg_loss.detach().item() * loss_scale_factor, ``` However I am wondering if these scalings are really...