verl Do these Metrics really need to be scaled?

I noticed there is some scaling with loss_scale_factor in the dp_actor:

     "actor/pg_loss": pg_loss.detach().item() * loss_scale_factor,

However I am wondering if these scalings are really needed, since the reduction over the metrics is by mean and not sum?

Aug 19 '25 12:08 fabianlim

The loss is scaled so that they are parallelism ignorant. Otherwise, you will find that these losses magnitude is different if you use different sp_size, which should not be the case

Aug 19 '25 15:08 vermouth1992

The loss is scaled so that they are parallelism ignorant. Otherwise, you will find that these losses magnitude is different if you use different sp_size, which should not be the case

@vermouth1992 yes I agree the loss should be scaled for the backward, but this was not my point. My point is only that for reporting, we shouldnt need to scale it becasue there is already a reduction outside by mean.

The key point you need to scale it for the backward is because multiple forward-backward calls will reduce by sum, but that is not the case for reporting.

Aug 19 '25 15:08 fabianlim

The loss is scaled so that they are parallelism ignorant. Otherwise, you will find that these losses magnitude is different if you use different sp_size, which should not be the case

@vermouth1992 yes I agree the loss should be scaled for the backward, but this was not my point. My point is only that for reporting, we shouldnt need to scale it becasue there is already a reduction outside by mean.

The key point you need to scale it for the backward is because multiple forward-backward calls will reduce by sum, but that is not the case for reporting.

I see what you mean. Maybe I was wrong. Could you confirm that those values will not change its magnitude when the ulysses sp size changes? Thanks!

Aug 20 '25 00:08 vermouth1992

@vermouth1992 its kind of roughly the same


> metrics['actor/pg_loss']
[-0.3054201006889343, 0.0, -0.2829209864139557, -0.16481846570968628, -0.15159031748771667]
> self.ulysses_sequence_parallel_size
1


> metrics['actor/pg_loss']
[1.2069861888885498, -0.24875977635383606, -0.2340698093175888, -0.1880173534154892, 0.0, -0.07240534573793411, 0.0, 0.0, -0.1661381721496582]
> self.ulysses_sequence_parallel_size
2

Aug 20 '25 02:08 fabianlim

I think the current implementation has some bugs, we'll fix it soon

Aug 20 '25 14:08 hiyouga

Agree. Current metrics looks confusing.

Sep 16 '25 14:09 Siyuexi

Agree

Nov 12 '25 10:11 Ber666