OpenModelica Modelica-Buildings Library (master): Iteration variable is NaN at initialization due to poor causalization of if-equations with many equations per branch

Description

The iteration variable becomes NaN at initialization and the solver tries to solve with NaN. This leads to continuous error and warning in the log file.

The models are listed below:

Buildings.Applications.DataCenters.DXCooled.Examples.DXCooledAirsideEconomizer
Buildings.Fluid.DXSystems.Cooling.AirSource.Examples.SpaceCooling

On simulating the model in OpenModelica leads to following errors/warnings:

1) Homotopy solver total pivot: Matrix (nearly) singular at initialization.
2) residualFunc1409: Iteration variable xloc[3] is nan.
3) residualFunc1409 failed at time=11880000. For more information please use -lv LOG_NLS.
4) While solving non-linear system an assertion failed during initialization.
5) solver will try to handle division by zero at time 11880000: 1.0 - varSpeDX.watVapEva.XEvaWetBulOut
division leads to inf or nan at time 1.188e+07, (a=1) / (b=0), where divisor b is: 1.0 - varSpeDX.watVapEva.XEvaWetBulOut
6) *** Warning: DX coil performance curves at nominal conditions are outside of bounds.              Capacity as a function of temperature  is outside of bounds 0.9 to 1.1.              The value of the curve fit is 1.21996              Check the coefficients of the function sta[1].perCur.capFunT.
7) Homotopy solver total pivot: Matrix (nearly) singular at initialization.
8) The following assertion has been violated during initialization at time 11880000.000000 ((varSpeDX.watVapEva.nomVal.gamma <= varSpeDX.watVapEva.gammaMax)) --> "*** Warning: In DX coil model, gamma is too large for these coil conditions.     Instead of gamma = 1.5, a value of 0.693093, which     corresponds to a mass transfer effectiveness of 0.8, will be used.     Coil nominal performance data are:     nomVal.m_flow_nominal = 82.8363     dX_nominal = XEvaOut_nominal-XEvaWetBulOut_nominal = 0.0088221 - 0.00965846 = -0.000836363     QLat_flow_nominal  = -200000 "
9)  Homotopy solver total pivot: Matrix (nearly) singular at initialization.
...

The log report for these models are as given below: (Source: https://libraries.openmodelica.org/branches/master/Buildings_latest/files/Buildings_latest_Buildings.Applications.DataCenters.DXCooled.Examples.DXCooledAirsideEconomizer.sim)

LOG_NLS           | error   | residualFunc1409: Iteration variable xloc[3] is nan.
LOG_ASSERT        | debug   | residualFunc1409 failed at time=11880000.
|                 | |       | For more information please use -lv LOG_NLS.
LOG_STDOUT        | warning | While solving non-linear system an assertion failed during initialization.
|                 | |       | | The non-linear solver tries to solve the problem that could take some time.
|                 | |       | | It could help to provide better start-values for the iteration variables.
|                 | |       | | For more information simulate with -lv LOG_NLS_V
LOG_NLS           | error   | residualFunc1409: Iteration variable xloc[2] is nan.
LOG_ASSERT        | debug   | residualFunc1409 failed at time=11880000.
|                 | |       | For more information please use -lv LOG_NLS.
LOG_NLS           | error   | residualFunc1409: Iteration variable xloc[3] is nan.
LOG_ASSERT        | debug   | residualFunc1409 failed at time=11880000.
|                 | |       | For more information please use -lv LOG_NLS.
LOG_NLS           | error   | residualFunc1409: Iteration variable xloc[3] is nan.
LOG_ASSERT        | debug   | residualFunc1409 failed at time=11880000.
|                 | |       | For more information please use -lv LOG_NLS.
LOG_NLS           | error   | residualFunc1409: Iteration variable xloc[3] is nan.
LOG_ASSERT        | debug   | residualFunc1409 failed at time=11880000.
|                 | |       | For more information please use -lv LOG_NLS.
LOG_ASSERT        | warning | [/home/hudson/saved_omc/libraries/.openmodelica/libraries/Buildings 10.0.0-master/Fluid/DXSystems/Cooling/BaseClasses/Functions/warnIfPerformanceOutOfBounds.mo:18:3-24:36:writable]
|                 | |       | *** Warning: DX coil performance curves at nominal conditions are outside of bounds.
|                 | |       |              Capacity as a function of temperature  is outside of bounds 0.9 to 1.1.
|                 | |       |              The value of the curve fit is 1.21996
|                 | |       |              Check the coefficients of the function sta[1].perCur.capFunT.
LOG_ASSERT        | warning | [/home/hudson/saved_omc/libraries/.openmodelica/libraries/Buildings 10.0.0-master/Fluid/DXSystems/Cooling/BaseClasses/Evaporation.mo:185:3-194:27:writable]
|                 | |       | The following assertion has been violated during initialization at time 11880000.000000
|                 | |       | ((varSpeDX.watVapEva.nomVal.gamma <= varSpeDX.watVapEva.gammaMax)) --> "*** Warning: In DX coil model, gamma is too large for these coil conditions.
|                 | |       |     Instead of gamma = 1.5, a value of 0.693093, which
|                 | |       |     corresponds to a mass transfer effectiveness of 0.8, will be used.
|                 | |       |     Coil nominal performance data are:
|                 | |       |     nomVal.m_flow_nominal = 82.8363
|                 | |       |     dX_nominal = XEvaOut_nominal-XEvaWetBulOut_nominal = 0.0088221 - 0.00965846 = -0.000836363
|                 | |       |     QLat_flow_nominal  = -200000
...

The list of warnings reported in the simulation test are given below: (Source: https://libraries.openmodelica.org/branches/master/Buildings_latest/files/Buildings_latest_Buildings.Applications.DataCenters.DXCooled.Examples.DXCooledAirsideEconomizer.err)

1) The   model contains alias variables with redundant start and/or conflicting   nominal values. 
2) Assuming fixed start value for the following 1 variables:   varSpeDX.watVapEva.off:DISCRETE(fixed = true protected = true )  "Signal, true when component is   off" type: Boolean
...

Additional Information

Probably related to https://github.com/OpenModelica/OpenModelica/issues/9539 and https://github.com/OpenModelica/OpenModelica/issues/8184

Steps to Reproduce

Load the Modelica-Buildings master library (Source: https://github.com/lbl-srg/modelica-buildings/). Simulate the following models seperately

Buildings.Applications.DataCenters.DXCooled.Examples.DXCooledAirsideEconomizer
Buildings.Fluid.DXSystems.Cooling.AirSource.Examples.SpaceCooling

Expected Behavior

Find solution for initial system without NaN error

Version and OS

OpenModelica Version: OpenModelica 1.22.0~dev-250-g48ea6c4
- OS: Ubuntu 20.04 focal Versions of used Modelica libraries: Modelica 4.0.0 Buildings master (10.0.0)

Jul 28 '23 06:07 Pavanrajhr

After the first analysis the following points can be highlighted:

both models simulate in OMEdit under Win11 but fail in the CI, probably because the simulation starts slowly and the flag -abortSlowSimulation is set in the CI

herebelow the simulation statistics from OMEdit:

Buildings.Applications.DataCenters.DXCooled.Examples.DXCooledAirsideEconomizer
timer
   0.0372366s          reading init.xml
    0.010129s          reading info.xml
   0.0011193s [  0.0%] pre-initialization
   0.0552991s [  0.3%] initialization
   0.0009275s [  0.0%] steps
    0.107128s [  0.7%] solver (excl. callbacks)
    0.005612s [  0.0%] creating output-file
    0.952345s [  5.8%] event-handling
   0.0043503s [  0.0%] overhead
     15.2268s [ 93.1%] simulation
     16.3536s [100.0%] total
events
   64 state events
    0 time events
solver: dassl
20592 steps taken
39800 calls of functionODE
15137 evaluations of jacobian
  661 error test failures
 4323 convergence test failures
13.5827s time of jacobian evaluation
The simulation finished successfully.

Buildings.Fluid.DXSystems.Cooling.AirSource.Examples.SpaceCooling
timer
   0.0679077s          reading init.xml
   0.0166237s          reading info.xml
   0.0014891s [  0.0%] pre-initialization
   0.0737792s [  0.4%] initialization
   0.0033364s [  0.0%] steps
    0.419421s [  2.3%] solver (excl. callbacks)
   0.0587206s [  0.3%] creating output-file
     4.04322s [ 22.4%] event-handling
   0.0412453s [  0.2%] overhead
     13.4221s [ 74.3%] simulation
     18.0633s [100.0%] total
events
 2356 state events
    0 time events
solver: dassl
125770 steps taken
162537 calls of functionODE
33845 evaluations of jacobian
 3122 error test failures
    0 convergence test failures
9.05001s time of jacobian evaluation
The simulation finished successfully.

the messages emitted during the simulation highlight that something is close to the singularity, especially at simulation start,
by reducing the simulation interval and the time interval the number of the messages reduces (or disappears) and the simulation is faster
by using the flag LOG_NLS_V I've not found any macroscopic issue in the solver steps
in the warning messages the variable xloc[...] often appears, but it seems to be generated by the compiler, @casella can you help me about this xloc variable? Is it generated by the compiler? for what?

OpenModelica v1.23.0-dev-413-gfc210a9629 (64-bit) Win11 Pro 64 bit Buildings master branch - SHA eb4ca9682a5dce538e35e582ba40b7e1738a6a3d

Apr 02 '24 16:04 AndreaBartolini

Thanks for doing the investigation @AndreaBartolini!

xloc is the local name for the array of iteration variables in an algebraic loop inside our generated C code. We do a NaN check at runtime, currently for iteration variables of algebraic loops only. Perhaps we should try to print the actual name of the variable instead of some cryptic index. :sweat_smile:

It is possible some previous computation lead to the NaN but no check was done before entering the algebraic loop computation.

Apr 02 '24 21:04 phannebohm

I checked further the SpaceCooling model and I think I have a good trail.

The model has three Boolean variables mulStaDX.watVapEva.off, sinSpeDX.watVapEva.off, and varSpeDx.watVapEva.off, which are switching a large number of times (see the 2356 events): immagine

which, I guess, doesn't help too much running the simulation.

At the end of the day, if I run the simulation without -abortSlowSimulation, the number of steps taken by DASSL in Dymola and OMC is very similar and the number of events (2356) is actually the same, so OMC is not faring that bad. BTW, I tried to increase the number of intervals from the default 500 to 5000 and to 50000, but I didn't see any dramatic change, as long as the simulation interval is the same. So, it seems that there is really only some issue in solving some of the algebraic loops in the system.

Also, I'm not sure if -abortSlowSimulation would abort the simulation anyway, given the large number of discrete events, even if we fixed this issue, we have to investigate this separately, and maybe increase the thresholds for such parameter to a larger value. @AnHeuermann could you direct me to the point in the source code where -abortSlowSimulation is taken into account?

Going back to the singularities, I noticed two fishy things going on. First, we get a large number of runtime warnings like

solver will try to handle division by zero at time 15811296.768: 1.0 - mulStaDX.watVapEva.XEvaWetBulOut
division leads to inf or nan at time 1.58113e+07, (a=1) / (b=0), where divisor b is: 1.0 - mulStaDX.watVapEva.XEvaWetBulOut

which is kind of weird, since mulStaDX.watVapEva.XEvaWetBulOut takes values between 0 and 0.018

immagine

There are also many other warnings such as:

residualFunc3166: Iteration variable xloc[3] is nan.
residualFunc3166 failed at time=15811200.0034178. For more information please use -lv LOG_NLS.

that involve system 3166. I checked it with the Declarative Debugger, it turns out it is a system of 7 nonlinear equations and 7 iteration variables (basically, no tearing) with these unknowns:

immagine

notice that they include mulStaDX.watVapEva.XEvaWetBulOut and with residuals such as:

(residual) if not mulStaDX.watVapEva.off then mulStaDX.watVapEva.mTotWat_flow - mulStaDX.dxCoi.mWat_flow else (-273.15 + mulStaDX.watVapEva.TEvaWetBulOut) * (1006.0 * (1.0 - mulStaDX.vol.dynBal.medium.Xi[1]) + 1860.0 * mulStaDX.watVapEva.XiSatRefOut + 4184.0 * (mulStaDX.vol.dynBal.medium.Xi[1] - mulStaDX.watVapEva.XiSatRefOut)) - ((-273.15 + mulStaDX.dxCoi.TEvaIn) * (1006.0 * (1.0 - mulStaDX.vol.dynBal.medium.Xi[1]) + 1860.0 * mulStaDX.vol.dynBal.medium.Xi[1]) + 2.5010145e6 * (mulStaDX.vol.dynBal.medium.Xi[1] - mulStaDX.watVapEva.XiSatRefOut)) = 0

or

(residual) if not mulStaDX.watVapEva.off then mulStaDX.watVapEva.dX else mulStaDX.watVapEva.XiSatRefOut - (1.0 - mulStaDX.vol.dynBal.medium.Xi[1]) * mulStaDX.watVapEva.XEvaWetBulOut / (1.0 - mulStaDX.watVapEva.XEvaWetBulOut) = 0

which looks like the one that causes the division by zero. My guess is that this system is the cause of all those warnings. Dymola reports the following figures about strong components:

Sizes of linear systems of equations: {5, 5, 5}
Sizes after manipulation of the linear systems: {0, 0, 0}
Sizes of nonlinear systems of equations: {3, 3, 1, 3, 2, 3, 3, 1, 3, 2, 3, 4, 1, 3, 2, 3}
Sizes after manipulation of the nonlinear systems: {1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1}

where "manipulation" basically means tearing. Note that there is no strong component of size 7.

I digged further in Dymola's dsmodel.mof file, and this is how Dymola computes mulStaDX.watVapEva.XEvaWetBulOut:

  // Nonlinear system of equations
  // Tag: simulation.nonlinear[11]
    // It depends on the following parameters: 
    //   mulStaDX.watVapEva.nomVal.p_nominal
    // It depends on the following timevarying variables: 
    //   mulStaDX.TIn
    //   mulStaDX.vol.dynBal.medium.Xi[1]
    //   mulStaDX.watVapEva.on
    // Start values for iteration variables of non-linear system of 1 equations: 
    //   mulStaDX.watVapEva.TEvaWetBulOut(start = 288.15)
  algorithm // Torn part
    mulStaDX.watVapEva.XEvaWetBulOut := (if mulStaDX.watVapEva.on then 0 else 
      smooth(99, 0.621964713077499/(mulStaDX.watVapEva.nomVal.p_nominal/smooth(99,
       611.657*exp(17.2799-4102.99/(mulStaDX.watVapEva.TEvaWetBulOut-35.719)))-
      0.378035286922501)));
    mulStaDX.watVapEva.XiSatRefOut := (if mulStaDX.watVapEva.on then 0 else (1-
      mulStaDX.vol.dynBal.medium.Xi[1])*mulStaDX.watVapEva.XEvaWetBulOut/(1-
      mulStaDX.watVapEva.XEvaWetBulOut));

  equation // Residual equations
    0 = (if mulStaDX.watVapEva.on then mulStaDX.watVapEva.TEvaWetBulOut-293.15
       else (mulStaDX.watVapEva.TEvaWetBulOut-273.15)*(1006.0*(1-
      mulStaDX.vol.dynBal.medium.Xi[1])+1860.0*mulStaDX.watVapEva.XiSatRefOut+
      4184.0*(mulStaDX.vol.dynBal.medium.Xi[1]-mulStaDX.watVapEva.XiSatRefOut))-
      ((mulStaDX.TIn-273.15)*(1006.0*(1-mulStaDX.vol.dynBal.medium.Xi[1])+1860.0
      *mulStaDX.vol.dynBal.medium.Xi[1])+2501014.5*(mulStaDX.vol.dynBal.medium.Xi[1]
      -mulStaDX.watVapEva.XiSatRefOut)));

so it manages to isolate a system with three unknowns mulStaDX.watVapEva.TEvaWetBulOut, mulStaDX.watVapEva.XEvaWetBulOut, and mulStaDX.watVapEva.XiSatRefOut instead of seven, and then to tear it down to one iteration variable mulStaDX.watVapEva.TEvaWetBulOut. This is much better than what the OMC backend can manage.

The original equations are found in Buildings.Fluid.DXSystems.Cooling.BaseClasses.Evaporation:

    if on then
      dX = 0;
      mEva_flow = 0;
      mTotWat_flow = mWat_flow;
      der(m) = -mWat_flow;
      TEvaWetBulOut = 293.15;
      XEvaWetBulOut = 0;
      XiSatRefOut = 0;
    else
      XiSatRefOut=(1-XEvaOut)*XEvaWetBulOut/(1-XEvaWetBulOut);
      XEvaWetBulOut  = Buildings.Utilities.Psychrometrics.Functions.X_pSatpphi(
        pSat = Buildings.Utilities.Psychrometrics.Functions.saturationPressureLiquid(TEvaWetBulOut),
        p =    nomVal.p_nominal,
        phi =  1);
      (TEvaWetBulOut-T_ref) * (
                (1-XEvaOut) * cpAir_nominal +
                XiSatRefOut * cpSte_nominal +
                (XEvaOut-XiSatRefOut) * cpWatLiq_nominal) =
      (TEvaOut-T_ref) * (
                (1-XEvaOut) * cpAir_nominal +
                XEvaOut * cpSte_nominal)  +
      (XEvaOut-XiSatRefOut) * Buildings.Utilities.Psychrometrics.Constants.h_fg;
      dX = smooth(1, noEvent(
         Buildings.Utilities.Math.Functions.spliceFunction(
         pos=XEvaOut,
         neg=XEvaWetBulOut,
         x=abs(mAir_flow)-nomVal.m_flow_nominal/2,
         deltax=nomVal.m_flow_nominal/3)))
        - XEvaWetBulOut;
      mEva_flow = -dX * smooth(1, noEvent(
        Buildings.Utilities.Math.Functions.spliceFunction(
         pos=if abs(mAir_flow) > mAir_flow_small/3 then
            abs(mAir_flow) * (1-Modelica.Math.exp(-K2*m*abs(mAir_flow)^(-0.2))) else 0,
         neg=K2*mAir_flow_small^(-0.2)*m*mAir_flow^2,
         x=abs(mAir_flow)- 2*mAir_flow_small/3,
         deltax=2*mAir_flow_small/6)));
      der(m) = -mEva_flow;
      mTotWat_flow = mWat_flow + mEva_flow;
    end if;

So, it seems that the backend is not smart enough when coupling the various equations in the two branches, so it ends up generating a single strong component. The first question to @phannebohm is: do you think this could be improved?

The second question is, how can the division by (1-XEvaWetBulOut) can become a division by zero? This is really odd to me, since XEvaWetBulOut is always much smaller than 1. The only explanation I can find is that XEvaWetBulOut is declared as:

protected Real mulStaDX.watVapEva.XEvaWetBulOut(quantity = "MassFraction", unit = "1", min = 0.0, max = 1.0);

so it may be that during iterations XEvaWetBulOut goes above 1.0 and is then limited by the solver to its maximum value 1.0. Is this feature implemented in OMC's solvers?

If that is the case, then the problems could be easily fixed by putting a more reasonable max value for that variable, say 0.05 - at room temperature, it is impossible than the concentration of water vapour in air gets higher than a few percentage point, and most definitely it cannot reach 100% ever.

Apr 04 '24 00:04 casella

Hmm, I added max = 0.05 to the declaration of XEvaWetBulOut but I still get divisions by zero. I turned on -lv=LOG_NLS_V, it turns out the first divisions by zero are generated by system 1739, which is the same as 3166 for the lambda0 initial equations.

The Newton solver starts from the start values, does a first step which seems reasonable, as the scaled residuals are below 1e-4, but then after the second step lapack fails and the solver gives up. Then, the hybrid solver starts, tries one iteration that gets mEvaFlow = NaN, then it tries changing the initial values "a bit", which means it changes them by their nominal value (which is not really a bit, it's a lot!), hence setting XEvaWetBulOut = 1. Boom.

@AnHeuermann, I guess this should be improved in the hybrid solver code, tentative values outside the min-max interval should of course never be chosen.

I then also changed the nominal values of the concentrations, not only their max attributes, and indeed the division by zero warnings were gone 😃. However, the xloc[3] (a.k.a. mEva_Flow) is NaN warnings were still there.

We could go down the rabbit hole of adding min-max-nominal values to all those 7 iteration variables, and then improve the solvers so that they actually make good use of that information. But I'm not sure this is the right way to go, though it wouldn't hurt if the homotopy solver stayed within bounds.

The bottom line is that the system coming from the 7 conditional equations should be really causalized and torn properly, as Dymola does. If we did it, we'd get

one torn system with 1 iteration variable mulStaDX.watVapEva.TEvaWetBulOut and two torn variables mulStaDX.watVapEva.XEvaWetBulOut and mulStaDX.watVapEva.XiSatRefOut
one explicitly solved equation to compute der(mulStaDX.watVapEva.m)
one explicitly solved equation to compute mulStaDX.watVapEva.dX
one explicitly solved equation to compute mulStaDX.watVapEva.mEva_flow
one explicitly solved equation to compute mTotWat_flow

This would be way more robust and less troublesome to handle than getting the 7-unknown nonlinear system to converge properly, as I understand the behaviour of the torn nonlinear system is very smooth w.r.t. its temperature iteration variables, and there is no need of other initial guesses of the remaining 6, which are computed explicitly.

@phannebohm the ball is yours 😄

Apr 04 '24 01:04 casella

xloc is the local name for the array of iteration variables in an algebraic loop inside our generated C code. We do a NaN check at runtime, currently for iteration variables of algebraic loops only. It is possible some previous computation lead to the NaN but no check was done before entering the algebraic loop computation.

As I understand, the NaN was generated while computing residuals during Newton iterations.

Apr 04 '24 02:04 casella

The Newton solver starts from the start values, does a first step which seems reasonable, as the scaled residuals are below 1e-4, but then after the second step lapack fails and the solver gives up. Then, the hybrid solver starts, tries one iteration that gets mEvaFlow = NaN, then it tries changing the initial values "a bit", which means it changes them by their nominal value (which is not really a bit, it's a lot!), hence setting XEvaWetBulOut = 1. Boom.

We could go down the rabbit hole of adding min-max-nominal values to all those 7 iteration variables, and then improve the solvers so that they actually make good use of that information. But I'm not sure this is the right way to go, though it wouldn't hurt if the homotopy solver stayed within bounds.

This behaviour where the nominal values get selected for initial values and cause singularities is a very interesting observation! We also sometimes get weird initialisation problems that we don't understand. Next time I'll add this trick to the toolbox, and see if tweaking the nominal values helps!

I guess what I'm saying is: improving the solver behaviour w.r.t. this behaviour still seems worthwhile! :-)

Apr 04 '24 06:04 bilderbuchi

@bilderbuchi see #12193.

Apr 04 '24 13:04 casella

So, it seems that the backend is not smart enough when coupling the various equations in the two branches, so it ends up generating a single strong component. The first question to @phannebohm is: do you think this could be improved?

AFAIK, the backend doesn't even try to split the IF-equation currently, the whole IF-equation becomes a fully connected subgraph for the matching, i.e., "every variable occurs in every equation" in the eyes of the compiler.

This is something @kabdelhak and myself talked about for the new backend. Splitting the IF-equation into single equations would make the adjacency much more sparse and we could even solve some parts explicitly, as in SpaceCooling.

If we knew which equations belong together across all IF-branches, then we could easily split the IF-equation into size-1 equations. There might be some heuristics to get this right some of the time. But if we are wrong, it would probably be difficult to change that decision after causalization.

Another approach might be to have a kind of tearing that, after the IF-equation is identified as part of an algebraic loop, tries to find a sorting for each branch and then perhaps split parts of the IF-equation.

Not sure when we will have time to work on this, but it's sooner rather than later.

Apr 08 '24 15:04 phannebohm

I guess some simple heuristics could already do wonders. If you have this pattern

if <cond> then
  v1 = <expr1>;
  v2 = <expr2>;
  v3 = <expr3>;
  <other equations1>
else
   v3 = <expr6>;
   v1 = <expr4>;
  <other equations2>
   v2 = <expr5>;
end if;

then you could replace it with

v1 = if <cond> then <expr1> else <expr4>;
v2 = if <cond> then <expr2> else <expr5>;
v3 = if <cond> then <expr3> else <expr6>;
if <cond> then
  <other equations1>
else
  <other equations2>
end if;

In fact, in models such as the one mentioned in this ticket, the second form is what you would actually want to write, but you use conditional equations as shorthand.

Apr 09 '24 16:04 casella