moose icon indicating copy to clipboard operation
moose copied to clipboard

Norm of ADVariableValue near 0 returns hard to understand exception

Open GiudGiud opened this issue 2 years ago • 13 comments

Bug description

Taking the norm of an ADVariableValue near 0 returns the message in the Discussion below. This is not ideal for user-facing output

Steps to reproduce

ADRealVectorValue(0).norm() should suffice to reproduce this.

Steps to fix

More floating point exceptions in moose to catch this Metaphysicl/libmesh exception

Impact

Better errors for users

Discussed in https://github.com/idaholab/moose/discussions/22072

Originally posted by aarograh September 12, 2022 I'm working on updating submodules and MOOSE environment for my app. The update is a small step forward (moose-libmesh version updated from 2022.03.18 to 2022.03.28) but breaks several of my tests with this error:

Floating point exception signaled (floating point divide by zero)!

To track this down, compile in debug mode, then in gdb do:
  break libmesh_handleFPE
  run ...
  bt

Attaching the debugger and following the instructions given by the error indicates that the error occurs at this line:

if (_velocity[_qp].norm() < 1.e-8)

_velocity is a ADVectorVariableValue object. It seems that in the update, the result of the .norm() function call has changed in some way. So I have 2 questions:

  1. I cannot for the life of me determine how to print out the components of _velocity[_qp]. It looks like it's ultimately a templated libMesh object. I figure if I could just inspect the object a bit, I might be able to determine what changed and fix it.
  2. Any ideas what may have changed with the norm function during that update?

GiudGiud avatar Sep 13 '22 02:09 GiudGiud

This is not a libMesh/MetaPhysicL exception. This is a floating point exception which is not an exception in the C++ sense; it is a signal. You cannot catch it. Perhaps you want to add a libMesh exception to TypeVector::norm? And then if so, we can catch those and emit a mooseError?

lindsayad avatar Sep 14 '22 20:09 lindsayad

It's a good point. I think it's a good idea. We have more and more AD users, and these errors are quite sneaky. Earlier today, we got the same with @tanoret with std::pow(some_ad, 1.5)

GiudGiud avatar Sep 17 '22 04:09 GiudGiud

@roystgnr do you think we should put in exception throwing for operations like TypeVector::norm if all coordinates are 0? This kind of thing is so tricky to make decisions on IMO since we would be performing LIBMESH_DIM new checks for every call to norm. But that's probably premature optimization thinking without any profiling to back up the concern

lindsayad avatar Sep 17 '22 17:09 lindsayad

What's the behavior of taking the norm of a non-AD RealVectorValue? Does it raise an exception or simply returns NaN? I'd prefer NaNs in this case, both for RealVectorValue and ADRealVectorValue.

hugary1995 avatar Sep 19 '22 01:09 hugary1995

Norm of a 0-vector is 0. The derivative is the only thing that is problematic

lindsayad avatar Sep 19 '22 16:09 lindsayad

Sorry, I meant the derivative, which is NaN.

hugary1995 avatar Sep 19 '22 17:09 hugary1995

Reasoning: in tensor_mechanics we decide whether to compute the derivatives based on is_ad, and therefore I would hope we could get consistent behavior. Since for a vector v the hand-coded derivative of its norm w.r.t. itself is v/v.norm(), I would hope the dual numbers of v.norm() to be NaNs as well.

hugary1995 avatar Sep 19 '22 17:09 hugary1995

We enable the following floating point "exceptions" by default in dbg mode:

void enableFPE(bool on)
{
#if !defined(LIBMESH_HAVE_FEENABLEEXCEPT) && defined(LIBMESH_HAVE_XMMINTRIN_H)
  static int flags = 0;
#endif

  if (on)
    {
#ifdef LIBMESH_HAVE_FEENABLEEXCEPT
      feenableexcept(FE_DIVBYZERO | FE_INVALID);

So both the AD and non-AD code (if it is actually doing v/v.norm()) should raise floating point "exceptions" in dbg mode and according to IEEE indeed should result in NaNs for other compilation modes

lindsayad avatar Sep 19 '22 18:09 lindsayad

do you think we should put in exception throwing for operations like TypeVector::norm if all coordinates are 0?

Thinking about it, I can't avoid the conclusion that at some point we ought to have two different behavioral options here:

In one category of use cases, it's entirely reasonable at the low level to return (0, NaN) for the std::norm(0) (value,derivatives). Calculations which then proceed to throw away the derivatives and only use the norm will work fine (albeit slower than they should, but that's what turning SIGFPE on and off helps us debug), and calculations which attempt to use the derivatives too will correctly propagate NaN rather than returning something that might be garbage. If I'm trying to use these derivatives to calculate a sensitivity solve or an adjoint solve, then the fact that I don't actually possess a two-side derivative and I can't calculate a precise sensitivity or adjoint solution is extremely important.

However, that's not what we do for e.g. std::abs(DualNumber) - where we return f'=0 for x=0 - and in the second category of use cases, which IIRC is pretty much all MOOSE code right now, that is clearly the right behavior. If we're just getting a Jacobian to use for a primal solve, we don't actually care if it's an exact Jacobian, we just care that it's close enough to a linearization for us to get as good a quasi-Newton step as we can get, and returning something that's "in the middle of" the set of one-sided derivatives is about as good an approximation as we can get for that purpose, and returning NaN would be something in between an annoying mess (if a kernel has to manually test for NaN around any operation that might not have well-defined derivatives) and an utter disaster (if user code is stuck running with SIGFPE enabled and so doesn't get the chance to test for NaN).

MOOSE in dbg mode is the "utter disaster" case, isn't it? Because enabling SIGFPE and treating it as always an error is a great way to hunt bugs? So we do need to do something about that.

I don't see exception-throwing as the solution, though. If we're not going to catch that exception in user code then we might as well just let the SIGFPE kill the program. And if we are going to catch the exception in user code, we might as well just replace the norm(v) call there with something along the lines of (raw_value(v) == 0 ? ADReal(0) : norm(v)) (with the typing fixed up, and encapsulated as regularized_norm(ADReal) or something. That'll be faster for everybody if we only do those tests where we expect to need them rather than in every single norm() call everywhere, and it should even be faster for the regularized_norm users if we just invoke a test-and-jump rather than of the whole C++ exception-throwing machinery.

The only way I could see an exception as being preferable is if we have cases where norm() is getting called at a low level, and the code calling it doesn't know whether norm(0) should be regularized or should scream-and-die, but code further up the stack does know. ... which doesn't sound entirely implausible, now that I think about it. Do we have cases like that?

roystgnr avatar Sep 19 '22 19:09 roystgnr

Isn't the limit if the derivative inifinity? Why not just use that instead of NaN?

dschwen avatar Sep 19 '22 19:09 dschwen

norm((x,y,z)) in 3D is a cone surface in 4D ... it's easy to picture in 2D, with a 3D cone. If you look at the directional derivative at 0, d(norm)/d(direction) will always be 1, not infinity at all, but because there's no gradient vector which has a dot product of 1 with every possible direction there's no well-defined gradient. The "average" gradient over all direction vectors is 0, same as with the two one-sided derivatives of abs() (which is the same as norm() in 1D).

roystgnr avatar Sep 19 '22 19:09 roystgnr

D'oh, yeah, never mind me. Tip of the cone is definitely not a well defined derivative.

dschwen avatar Sep 19 '22 20:09 dschwen

The only way I could see an exception as being preferable is if we have cases where norm() is getting called at a low level, and the code calling it doesn't know whether norm(0) should be regularized or should scream-and-die, but code further up the stack does know. ... which doesn't sound entirely implausible, now that I think about it. Do we have cases like that?

I feel like most calls to norm are in user code, or at least at the MOOSE framework level and above.

we might as well just replace the norm(v) call there with something along the lines of (raw_value(v) == 0 ? ADReal(0) : norm(v)) (with the typing fixed up, and encapsulated as regularized_norm(ADReal) or something.

I like that

lindsayad avatar Sep 19 '22 23:09 lindsayad