Modelica-Buildings Library (master): Iteration variable is NaN at initialization due to poor causalization of if-equations with many equations per branch
Description
The iteration variable becomes NaN at initialization and the solver tries to solve with NaN. This leads to continuous error and warning in the log file.
The models are listed below:
Buildings.Applications.DataCenters.DXCooled.Examples.DXCooledAirsideEconomizer
Buildings.Fluid.DXSystems.Cooling.AirSource.Examples.SpaceCooling
On simulating the model in OpenModelica leads to following errors/warnings:
1) Homotopy solver total pivot: Matrix (nearly) singular at initialization.
2) residualFunc1409: Iteration variable xloc[3] is nan.
3) residualFunc1409 failed at time=11880000. For more information please use -lv LOG_NLS.
4) While solving non-linear system an assertion failed during initialization.
5) solver will try to handle division by zero at time 11880000: 1.0 - varSpeDX.watVapEva.XEvaWetBulOut
division leads to inf or nan at time 1.188e+07, (a=1) / (b=0), where divisor b is: 1.0 - varSpeDX.watVapEva.XEvaWetBulOut
6) *** Warning: DX coil performance curves at nominal conditions are outside of bounds. Capacity as a function of temperature is outside of bounds 0.9 to 1.1. The value of the curve fit is 1.21996 Check the coefficients of the function sta[1].perCur.capFunT.
7) Homotopy solver total pivot: Matrix (nearly) singular at initialization.
8) The following assertion has been violated during initialization at time 11880000.000000 ((varSpeDX.watVapEva.nomVal.gamma <= varSpeDX.watVapEva.gammaMax)) --> "*** Warning: In DX coil model, gamma is too large for these coil conditions. Instead of gamma = 1.5, a value of 0.693093, which corresponds to a mass transfer effectiveness of 0.8, will be used. Coil nominal performance data are: nomVal.m_flow_nominal = 82.8363 dX_nominal = XEvaOut_nominal-XEvaWetBulOut_nominal = 0.0088221 - 0.00965846 = -0.000836363 QLat_flow_nominal = -200000 "
9) Homotopy solver total pivot: Matrix (nearly) singular at initialization.
...
The log report for these models are as given below: (Source: https://libraries.openmodelica.org/branches/master/Buildings_latest/files/Buildings_latest_Buildings.Applications.DataCenters.DXCooled.Examples.DXCooledAirsideEconomizer.sim)
LOG_NLS | error | residualFunc1409: Iteration variable xloc[3] is nan.
LOG_ASSERT | debug | residualFunc1409 failed at time=11880000.
| | | | For more information please use -lv LOG_NLS.
LOG_STDOUT | warning | While solving non-linear system an assertion failed during initialization.
| | | | | The non-linear solver tries to solve the problem that could take some time.
| | | | | It could help to provide better start-values for the iteration variables.
| | | | | For more information simulate with -lv LOG_NLS_V
LOG_NLS | error | residualFunc1409: Iteration variable xloc[2] is nan.
LOG_ASSERT | debug | residualFunc1409 failed at time=11880000.
| | | | For more information please use -lv LOG_NLS.
LOG_NLS | error | residualFunc1409: Iteration variable xloc[3] is nan.
LOG_ASSERT | debug | residualFunc1409 failed at time=11880000.
| | | | For more information please use -lv LOG_NLS.
LOG_NLS | error | residualFunc1409: Iteration variable xloc[3] is nan.
LOG_ASSERT | debug | residualFunc1409 failed at time=11880000.
| | | | For more information please use -lv LOG_NLS.
LOG_NLS | error | residualFunc1409: Iteration variable xloc[3] is nan.
LOG_ASSERT | debug | residualFunc1409 failed at time=11880000.
| | | | For more information please use -lv LOG_NLS.
LOG_ASSERT | warning | [/home/hudson/saved_omc/libraries/.openmodelica/libraries/Buildings 10.0.0-master/Fluid/DXSystems/Cooling/BaseClasses/Functions/warnIfPerformanceOutOfBounds.mo:18:3-24:36:writable]
| | | | *** Warning: DX coil performance curves at nominal conditions are outside of bounds.
| | | | Capacity as a function of temperature is outside of bounds 0.9 to 1.1.
| | | | The value of the curve fit is 1.21996
| | | | Check the coefficients of the function sta[1].perCur.capFunT.
LOG_ASSERT | warning | [/home/hudson/saved_omc/libraries/.openmodelica/libraries/Buildings 10.0.0-master/Fluid/DXSystems/Cooling/BaseClasses/Evaporation.mo:185:3-194:27:writable]
| | | | The following assertion has been violated during initialization at time 11880000.000000
| | | | ((varSpeDX.watVapEva.nomVal.gamma <= varSpeDX.watVapEva.gammaMax)) --> "*** Warning: In DX coil model, gamma is too large for these coil conditions.
| | | | Instead of gamma = 1.5, a value of 0.693093, which
| | | | corresponds to a mass transfer effectiveness of 0.8, will be used.
| | | | Coil nominal performance data are:
| | | | nomVal.m_flow_nominal = 82.8363
| | | | dX_nominal = XEvaOut_nominal-XEvaWetBulOut_nominal = 0.0088221 - 0.00965846 = -0.000836363
| | | | QLat_flow_nominal = -200000
...
The list of warnings reported in the simulation test are given below: (Source: https://libraries.openmodelica.org/branches/master/Buildings_latest/files/Buildings_latest_Buildings.Applications.DataCenters.DXCooled.Examples.DXCooledAirsideEconomizer.err)
1) The model contains alias variables with redundant start and/or conflicting nominal values.
2) Assuming fixed start value for the following 1 variables: varSpeDX.watVapEva.off:DISCRETE(fixed = true protected = true ) "Signal, true when component is off" type: Boolean
...
Additional Information
Probably related to https://github.com/OpenModelica/OpenModelica/issues/9539 and https://github.com/OpenModelica/OpenModelica/issues/8184
Steps to Reproduce
Load the Modelica-Buildings master library (Source: https://github.com/lbl-srg/modelica-buildings/). Simulate the following models seperately
- Buildings.Applications.DataCenters.DXCooled.Examples.DXCooledAirsideEconomizer
- Buildings.Fluid.DXSystems.Cooling.AirSource.Examples.SpaceCooling
Expected Behavior
Find solution for initial system without NaN error
Version and OS
- OpenModelica Version: OpenModelica 1.22.0~dev-250-g48ea6c4
-
- OS: Ubuntu 20.04 focal Versions of used Modelica libraries: Modelica 4.0.0 Buildings master (10.0.0)
After the first analysis the following points can be highlighted:
- both models simulate in OMEdit under Win11 but fail in the CI, probably because the simulation starts slowly and the flag
-abortSlowSimulationis set in the CI
herebelow the simulation statistics from OMEdit:
Buildings.Applications.DataCenters.DXCooled.Examples.DXCooledAirsideEconomizer
timer
0.0372366s reading init.xml
0.010129s reading info.xml
0.0011193s [ 0.0%] pre-initialization
0.0552991s [ 0.3%] initialization
0.0009275s [ 0.0%] steps
0.107128s [ 0.7%] solver (excl. callbacks)
0.005612s [ 0.0%] creating output-file
0.952345s [ 5.8%] event-handling
0.0043503s [ 0.0%] overhead
15.2268s [ 93.1%] simulation
16.3536s [100.0%] total
events
64 state events
0 time events
solver: dassl
20592 steps taken
39800 calls of functionODE
15137 evaluations of jacobian
661 error test failures
4323 convergence test failures
13.5827s time of jacobian evaluation
The simulation finished successfully.
Buildings.Fluid.DXSystems.Cooling.AirSource.Examples.SpaceCooling
timer
0.0679077s reading init.xml
0.0166237s reading info.xml
0.0014891s [ 0.0%] pre-initialization
0.0737792s [ 0.4%] initialization
0.0033364s [ 0.0%] steps
0.419421s [ 2.3%] solver (excl. callbacks)
0.0587206s [ 0.3%] creating output-file
4.04322s [ 22.4%] event-handling
0.0412453s [ 0.2%] overhead
13.4221s [ 74.3%] simulation
18.0633s [100.0%] total
events
2356 state events
0 time events
solver: dassl
125770 steps taken
162537 calls of functionODE
33845 evaluations of jacobian
3122 error test failures
0 convergence test failures
9.05001s time of jacobian evaluation
The simulation finished successfully.
-
the messages emitted during the simulation highlight that something is close to the singularity, especially at simulation start,
-
by reducing the simulation interval and the time interval the number of the messages reduces (or disappears) and the simulation is faster
-
by using the flag LOG_NLS_V I've not found any macroscopic issue in the solver steps
-
in the warning messages the variable
xloc[...]often appears, but it seems to be generated by the compiler, @casella can you help me about thisxlocvariable? Is it generated by the compiler? for what?
OpenModelica v1.23.0-dev-413-gfc210a9629 (64-bit) Win11 Pro 64 bit Buildings master branch - SHA eb4ca9682a5dce538e35e582ba40b7e1738a6a3d
Thanks for doing the investigation @AndreaBartolini!
xloc is the local name for the array of iteration variables in an algebraic loop inside our generated C code. We do a NaN check at runtime, currently for iteration variables of algebraic loops only. Perhaps we should try to print the actual name of the variable instead of some cryptic index. :sweat_smile:
It is possible some previous computation lead to the NaN but no check was done before entering the algebraic loop computation.
I checked further the SpaceCooling model and I think I have a good trail.
The model has three Boolean variables mulStaDX.watVapEva.off, sinSpeDX.watVapEva.off, and varSpeDx.watVapEva.off, which are switching a large number of times (see the 2356 events):
which, I guess, doesn't help too much running the simulation.
At the end of the day, if I run the simulation without -abortSlowSimulation, the number of steps taken by DASSL in Dymola and OMC is very similar and the number of events (2356) is actually the same, so OMC is not faring that bad. BTW, I tried to increase the number of intervals from the default 500 to 5000 and to 50000, but I didn't see any dramatic change, as long as the simulation interval is the same. So, it seems that there is really only some issue in solving some of the algebraic loops in the system.
Also, I'm not sure if -abortSlowSimulation would abort the simulation anyway, given the large number of discrete events, even if we fixed this issue, we have to investigate this separately, and maybe increase the thresholds for such parameter to a larger value. @AnHeuermann could you direct me to the point in the source code where -abortSlowSimulation is taken into account?
Going back to the singularities, I noticed two fishy things going on. First, we get a large number of runtime warnings like
solver will try to handle division by zero at time 15811296.768: 1.0 - mulStaDX.watVapEva.XEvaWetBulOut
division leads to inf or nan at time 1.58113e+07, (a=1) / (b=0), where divisor b is: 1.0 - mulStaDX.watVapEva.XEvaWetBulOut
which is kind of weird, since mulStaDX.watVapEva.XEvaWetBulOut takes values between 0 and 0.018
There are also many other warnings such as:
residualFunc3166: Iteration variable xloc[3] is nan.
residualFunc3166 failed at time=15811200.0034178. For more information please use -lv LOG_NLS.
that involve system 3166. I checked it with the Declarative Debugger, it turns out it is a system of 7 nonlinear equations and 7 iteration variables (basically, no tearing) with these unknowns:
notice that they include mulStaDX.watVapEva.XEvaWetBulOut and with residuals such as:
(residual) if not mulStaDX.watVapEva.off then mulStaDX.watVapEva.mTotWat_flow - mulStaDX.dxCoi.mWat_flow else (-273.15 + mulStaDX.watVapEva.TEvaWetBulOut) * (1006.0 * (1.0 - mulStaDX.vol.dynBal.medium.Xi[1]) + 1860.0 * mulStaDX.watVapEva.XiSatRefOut + 4184.0 * (mulStaDX.vol.dynBal.medium.Xi[1] - mulStaDX.watVapEva.XiSatRefOut)) - ((-273.15 + mulStaDX.dxCoi.TEvaIn) * (1006.0 * (1.0 - mulStaDX.vol.dynBal.medium.Xi[1]) + 1860.0 * mulStaDX.vol.dynBal.medium.Xi[1]) + 2.5010145e6 * (mulStaDX.vol.dynBal.medium.Xi[1] - mulStaDX.watVapEva.XiSatRefOut)) = 0
or
(residual) if not mulStaDX.watVapEva.off then mulStaDX.watVapEva.dX else mulStaDX.watVapEva.XiSatRefOut - (1.0 - mulStaDX.vol.dynBal.medium.Xi[1]) * mulStaDX.watVapEva.XEvaWetBulOut / (1.0 - mulStaDX.watVapEva.XEvaWetBulOut) = 0
which looks like the one that causes the division by zero. My guess is that this system is the cause of all those warnings. Dymola reports the following figures about strong components:
Sizes of linear systems of equations: {5, 5, 5}
Sizes after manipulation of the linear systems: {0, 0, 0}
Sizes of nonlinear systems of equations: {3, 3, 1, 3, 2, 3, 3, 1, 3, 2, 3, 4, 1, 3, 2, 3}
Sizes after manipulation of the nonlinear systems: {1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1}
where "manipulation" basically means tearing. Note that there is no strong component of size 7.
I digged further in Dymola's dsmodel.mof file, and this is how Dymola computes mulStaDX.watVapEva.XEvaWetBulOut:
// Nonlinear system of equations
// Tag: simulation.nonlinear[11]
// It depends on the following parameters:
// mulStaDX.watVapEva.nomVal.p_nominal
// It depends on the following timevarying variables:
// mulStaDX.TIn
// mulStaDX.vol.dynBal.medium.Xi[1]
// mulStaDX.watVapEva.on
// Start values for iteration variables of non-linear system of 1 equations:
// mulStaDX.watVapEva.TEvaWetBulOut(start = 288.15)
algorithm // Torn part
mulStaDX.watVapEva.XEvaWetBulOut := (if mulStaDX.watVapEva.on then 0 else
smooth(99, 0.621964713077499/(mulStaDX.watVapEva.nomVal.p_nominal/smooth(99,
611.657*exp(17.2799-4102.99/(mulStaDX.watVapEva.TEvaWetBulOut-35.719)))-
0.378035286922501)));
mulStaDX.watVapEva.XiSatRefOut := (if mulStaDX.watVapEva.on then 0 else (1-
mulStaDX.vol.dynBal.medium.Xi[1])*mulStaDX.watVapEva.XEvaWetBulOut/(1-
mulStaDX.watVapEva.XEvaWetBulOut));
equation // Residual equations
0 = (if mulStaDX.watVapEva.on then mulStaDX.watVapEva.TEvaWetBulOut-293.15
else (mulStaDX.watVapEva.TEvaWetBulOut-273.15)*(1006.0*(1-
mulStaDX.vol.dynBal.medium.Xi[1])+1860.0*mulStaDX.watVapEva.XiSatRefOut+
4184.0*(mulStaDX.vol.dynBal.medium.Xi[1]-mulStaDX.watVapEva.XiSatRefOut))-
((mulStaDX.TIn-273.15)*(1006.0*(1-mulStaDX.vol.dynBal.medium.Xi[1])+1860.0
*mulStaDX.vol.dynBal.medium.Xi[1])+2501014.5*(mulStaDX.vol.dynBal.medium.Xi[1]
-mulStaDX.watVapEva.XiSatRefOut)));
so it manages to isolate a system with three unknowns mulStaDX.watVapEva.TEvaWetBulOut, mulStaDX.watVapEva.XEvaWetBulOut, and mulStaDX.watVapEva.XiSatRefOut instead of seven, and then to tear it down to one iteration variable mulStaDX.watVapEva.TEvaWetBulOut. This is much better than what the OMC backend can manage.
The original equations are found in Buildings.Fluid.DXSystems.Cooling.BaseClasses.Evaporation:
if on then
dX = 0;
mEva_flow = 0;
mTotWat_flow = mWat_flow;
der(m) = -mWat_flow;
TEvaWetBulOut = 293.15;
XEvaWetBulOut = 0;
XiSatRefOut = 0;
else
XiSatRefOut=(1-XEvaOut)*XEvaWetBulOut/(1-XEvaWetBulOut);
XEvaWetBulOut = Buildings.Utilities.Psychrometrics.Functions.X_pSatpphi(
pSat = Buildings.Utilities.Psychrometrics.Functions.saturationPressureLiquid(TEvaWetBulOut),
p = nomVal.p_nominal,
phi = 1);
(TEvaWetBulOut-T_ref) * (
(1-XEvaOut) * cpAir_nominal +
XiSatRefOut * cpSte_nominal +
(XEvaOut-XiSatRefOut) * cpWatLiq_nominal) =
(TEvaOut-T_ref) * (
(1-XEvaOut) * cpAir_nominal +
XEvaOut * cpSte_nominal) +
(XEvaOut-XiSatRefOut) * Buildings.Utilities.Psychrometrics.Constants.h_fg;
dX = smooth(1, noEvent(
Buildings.Utilities.Math.Functions.spliceFunction(
pos=XEvaOut,
neg=XEvaWetBulOut,
x=abs(mAir_flow)-nomVal.m_flow_nominal/2,
deltax=nomVal.m_flow_nominal/3)))
- XEvaWetBulOut;
mEva_flow = -dX * smooth(1, noEvent(
Buildings.Utilities.Math.Functions.spliceFunction(
pos=if abs(mAir_flow) > mAir_flow_small/3 then
abs(mAir_flow) * (1-Modelica.Math.exp(-K2*m*abs(mAir_flow)^(-0.2))) else 0,
neg=K2*mAir_flow_small^(-0.2)*m*mAir_flow^2,
x=abs(mAir_flow)- 2*mAir_flow_small/3,
deltax=2*mAir_flow_small/6)));
der(m) = -mEva_flow;
mTotWat_flow = mWat_flow + mEva_flow;
end if;
So, it seems that the backend is not smart enough when coupling the various equations in the two branches, so it ends up generating a single strong component. The first question to @phannebohm is: do you think this could be improved?
The second question is, how can the division by (1-XEvaWetBulOut) can become a division by zero? This is really odd to me, since XEvaWetBulOut is always much smaller than 1. The only explanation I can find is that XEvaWetBulOut is declared as:
protected Real mulStaDX.watVapEva.XEvaWetBulOut(quantity = "MassFraction", unit = "1", min = 0.0, max = 1.0);
so it may be that during iterations XEvaWetBulOut goes above 1.0 and is then limited by the solver to its maximum value 1.0. Is this feature implemented in OMC's solvers?
If that is the case, then the problems could be easily fixed by putting a more reasonable max value for that variable, say 0.05 - at room temperature, it is impossible than the concentration of water vapour in air gets higher than a few percentage point, and most definitely it cannot reach 100% ever.
Hmm, I added max = 0.05 to the declaration of XEvaWetBulOut but I still get divisions by zero. I turned on -lv=LOG_NLS_V, it turns out the first divisions by zero are generated by system 1739, which is the same as 3166 for the lambda0 initial equations.
The Newton solver starts from the start values, does a first step which seems reasonable, as the scaled residuals are below 1e-4, but then after the second step lapack fails and the solver gives up. Then, the hybrid solver starts, tries one iteration that gets mEvaFlow = NaN, then it tries changing the initial values "a bit", which means it changes them by their nominal value (which is not really a bit, it's a lot!), hence setting XEvaWetBulOut = 1. Boom.
@AnHeuermann, I guess this should be improved in the hybrid solver code, tentative values outside the min-max interval should of course never be chosen.
I then also changed the nominal values of the concentrations, not only their max attributes, and indeed the division by zero warnings were gone 😃. However, the xloc[3] (a.k.a. mEva_Flow) is NaN warnings were still there.
We could go down the rabbit hole of adding min-max-nominal values to all those 7 iteration variables, and then improve the solvers so that they actually make good use of that information. But I'm not sure this is the right way to go, though it wouldn't hurt if the homotopy solver stayed within bounds.
The bottom line is that the system coming from the 7 conditional equations should be really causalized and torn properly, as Dymola does. If we did it, we'd get
- one torn system with 1 iteration variable
mulStaDX.watVapEva.TEvaWetBulOutand two torn variablesmulStaDX.watVapEva.XEvaWetBulOutandmulStaDX.watVapEva.XiSatRefOut - one explicitly solved equation to compute
der(mulStaDX.watVapEva.m) - one explicitly solved equation to compute
mulStaDX.watVapEva.dX - one explicitly solved equation to compute
mulStaDX.watVapEva.mEva_flow - one explicitly solved equation to compute
mTotWat_flow
This would be way more robust and less troublesome to handle than getting the 7-unknown nonlinear system to converge properly, as I understand the behaviour of the torn nonlinear system is very smooth w.r.t. its temperature iteration variables, and there is no need of other initial guesses of the remaining 6, which are computed explicitly.
@phannebohm the ball is yours 😄
xlocis the local name for the array of iteration variables in an algebraic loop inside our generated C code. We do a NaN check at runtime, currently for iteration variables of algebraic loops only. It is possible some previous computation lead to the NaN but no check was done before entering the algebraic loop computation.
As I understand, the NaN was generated while computing residuals during Newton iterations.
The Newton solver starts from the start values, does a first step which seems reasonable, as the scaled residuals are below 1e-4, but then after the second step lapack fails and the solver gives up. Then, the hybrid solver starts, tries one iteration that gets
mEvaFlow = NaN, then it tries changing the initial values "a bit", which means it changes them by their nominal value (which is not really a bit, it's a lot!), hence settingXEvaWetBulOut = 1. Boom.
We could go down the rabbit hole of adding min-max-nominal values to all those 7 iteration variables, and then improve the solvers so that they actually make good use of that information. But I'm not sure this is the right way to go, though it wouldn't hurt if the homotopy solver stayed within bounds.
This behaviour where the nominal values get selected for initial values and cause singularities is a very interesting observation! We also sometimes get weird initialisation problems that we don't understand. Next time I'll add this trick to the toolbox, and see if tweaking the nominal values helps!
I guess what I'm saying is: improving the solver behaviour w.r.t. this behaviour still seems worthwhile! :-)
@bilderbuchi see #12193.
So, it seems that the backend is not smart enough when coupling the various equations in the two branches, so it ends up generating a single strong component. The first question to @phannebohm is: do you think this could be improved?
AFAIK, the backend doesn't even try to split the IF-equation currently, the whole IF-equation becomes a fully connected subgraph for the matching, i.e., "every variable occurs in every equation" in the eyes of the compiler.
This is something @kabdelhak and myself talked about for the new backend. Splitting the IF-equation into single equations would make the adjacency much more sparse and we could even solve some parts explicitly, as in SpaceCooling.
If we knew which equations belong together across all IF-branches, then we could easily split the IF-equation into size-1 equations. There might be some heuristics to get this right some of the time. But if we are wrong, it would probably be difficult to change that decision after causalization.
Another approach might be to have a kind of tearing that, after the IF-equation is identified as part of an algebraic loop, tries to find a sorting for each branch and then perhaps split parts of the IF-equation.
Not sure when we will have time to work on this, but it's sooner rather than later.
I guess some simple heuristics could already do wonders. If you have this pattern
if <cond> then
v1 = <expr1>;
v2 = <expr2>;
v3 = <expr3>;
<other equations1>
else
v3 = <expr6>;
v1 = <expr4>;
<other equations2>
v2 = <expr5>;
end if;
then you could replace it with
v1 = if <cond> then <expr1> else <expr4>;
v2 = if <cond> then <expr2> else <expr5>;
v3 = if <cond> then <expr3> else <expr6>;
if <cond> then
<other equations1>
else
<other equations2>
end if;
In fact, in models such as the one mentioned in this ticket, the second form is what you would actually want to write, but you use conditional equations as shorthand.