`torch.sum` in NPSE
We use the sum across theta-dimensions for NPSE here.
Because of this, we should technically use a different learning rate for different theta-dimensions. Should we not just use the mean?
When doing this, we will also have to update the code for the control_variate.
Nothing against it, it just the MSE usually use the sum (because thats the l2 norm).
Not sure why one needs different learning rates, the log_probs per dimension in all the flows are also reduced by a sum, no?
That's true, a flow also sums the log-probs. I think I am fine either way, but I did run into trouble in very high-D parameter spaces (because the gradient adds up in each dim).
The scale of the gradient shouldn't really matter for an Adam optimizer, so I'm not sure if that is the issue. I do think taking the mean instead of the sum is good practice though, as I guess in very high-D problems numerical issues are more likely.