verl icon indicating copy to clipboard operation
verl copied to clipboard

Do these Metrics really need to be scaled?

Open fabianlim opened this issue 4 months ago • 8 comments

I noticed there is some scaling with loss_scale_factor in the dp_actor:

     "actor/pg_loss": pg_loss.detach().item() * loss_scale_factor,

However I am wondering if these scalings are really needed, since the reduction over the metrics is by mean and not sum?

fabianlim avatar Aug 19 '25 12:08 fabianlim

The loss is scaled so that they are parallelism ignorant. Otherwise, you will find that these losses magnitude is different if you use different sp_size, which should not be the case

vermouth1992 avatar Aug 19 '25 15:08 vermouth1992

The loss is scaled so that they are parallelism ignorant. Otherwise, you will find that these losses magnitude is different if you use different sp_size, which should not be the case

@vermouth1992 yes I agree the loss should be scaled for the backward, but this was not my point. My point is only that for reporting, we shouldnt need to scale it becasue there is already a reduction outside by mean.

The key point you need to scale it for the backward is because multiple forward-backward calls will reduce by sum, but that is not the case for reporting.

fabianlim avatar Aug 19 '25 15:08 fabianlim

The loss is scaled so that they are parallelism ignorant. Otherwise, you will find that these losses magnitude is different if you use different sp_size, which should not be the case

@vermouth1992 yes I agree the loss should be scaled for the backward, but this was not my point. My point is only that for reporting, we shouldnt need to scale it becasue there is already a reduction outside by mean.

The key point you need to scale it for the backward is because multiple forward-backward calls will reduce by sum, but that is not the case for reporting.

I see what you mean. Maybe I was wrong. Could you confirm that those values will not change its magnitude when the ulysses sp size changes? Thanks!

vermouth1992 avatar Aug 20 '25 00:08 vermouth1992

@vermouth1992 its kind of roughly the same


> metrics['actor/pg_loss']
[-0.3054201006889343, 0.0, -0.2829209864139557, -0.16481846570968628, -0.15159031748771667]
> self.ulysses_sequence_parallel_size
1


> metrics['actor/pg_loss']
[1.2069861888885498, -0.24875977635383606, -0.2340698093175888, -0.1880173534154892, 0.0, -0.07240534573793411, 0.0, 0.0, -0.1661381721496582]
> self.ulysses_sequence_parallel_size
2

fabianlim avatar Aug 20 '25 02:08 fabianlim

I think the current implementation has some bugs, we'll fix it soon

hiyouga avatar Aug 20 '25 14:08 hiyouga

Agree. Current metrics looks confusing.

Siyuexi avatar Sep 16 '25 14:09 Siyuexi

Agree

Ber666 avatar Nov 12 '25 10:11 Ber666