`torch.sum` in NPSE

Open michaeldeistler opened this issue 2 months ago • 3 comments

We use the sum across theta-dimensions for NPSE here.

Because of this, we should technically use a different learning rate for different theta-dimensions. Should we not just use the mean?

When doing this, we will also have to update the code for the control_variate.

Nov 03 '25 14:11 michaeldeistler

Nothing against it, it just the MSE usually use the sum (because thats the l2 norm).

Not sure why one needs different learning rates, the log_probs per dimension in all the flows are also reduced by a sum, no?

Nov 04 '25 15:11 manuelgloeckler

That's true, a flow also sums the log-probs. I think I am fine either way, but I did run into trouble in very high-D parameter spaces (because the gradient adds up in each dim).

Nov 04 '25 15:11 michaeldeistler

The scale of the gradient shouldn't really matter for an Adam optimizer, so I'm not sure if that is the issue. I do think taking the mean instead of the sum is good practice though, as I guess in very high-D problems numerical issues are more likely.

Nov 10 '25 13:11 gmoss13