raven
raven copied to clipboard
[DEFECT] Throwing exception when running in parallel if Code fails.
Thank you for the defect report
- [X] I am using the latest version of
RAVEN
. - [X] I have read the Wiki.
- [ ] I have created a minimum, reproducible example that demonstrates the defect.
Defect Description
In a Raven Runs Raven input, it gets this error when running in parallel:
Traceback (most recent call last):
File "/home/user/raven/raven_framework.py", line 26, in <module>
sys.exit(main(True))
File "/home/user/raven/ravenframework/Driver.py", line 203, in main
raven()
File "/home/user/raven/ravenframework/Driver.py", line 156, in raven
simulation.run()
File "/home/user/raven/ravenframework/Simulation.py", line 892, in run
self.executeStep(stepInputDict, stepInstance)
File "/home/user/raven/ravenframework/Simulation.py", line 825, in executeStep
stepInstance.takeAstep(stepInputDict)
File "/home/user/raven/ravenframework/Steps/Step.py", line 317, in takeAstep
self._localTakeAstepRun(inDictionary)
File "/home/user/raven/ravenframework/Steps/MultiRun.py", line 178, in _localTakeAstepRun
myLambda([finishedJob,outputs[outIndex]])
File "/home/user/raven/ravenframework/Steps/MultiRun.py", line 109, in <lambda>
self._outputCollectionLambda.append( (lambda x: inDictionary['Model'].collectOutput(x[0],x[1]), outIndex) )
File "/home/user/raven/ravenframework/Models/Code.py", line 774, in collectOutput
self._replaceVariablesNamesWithAliasSystem(evaluation, 'input',True)
File "/home/user/raven/ravenframework/Models/Model.py", line 302, in _replaceVariablesNamesWithAliasSystem
found = sampledVars.pop(whichVar,[notFound])
AttributeError: 'Error' object has no attribute 'pop'
The pyomo failed:
WARNING: Loading a SolverResults object with a warning status into
model.name="unknown";
- termination condition: infeasible
- message from solver: <undefined>
DEBUGG ... solve was unsuccessful!
DEBUGG ... status: warning
DEBUGG ... termination: infeasible
Traceback (most recent call last):
File "/home/user/raven/ravenframework/Models/EnsembleModel.py", line 718, in __advanceModel
evaluation = modelToExecute['Instance'].evaluateSample.original_function(modelToExecute['Instance'], origInputList, samplerType, inputKwargs)
File "/home/user/raven/ravenframework/Models/ExternalModel.py", line 324, in evaluateSample
result,instSelf = self._externalRun(inRun,)
File "/home/user/raven/ravenframework/Models/ExternalModel.py", line 266, in _externalRun
self.sim.run(externalSelf, InputDict)
File "/home/user/raven/plugins/HERON/src/DispatchManager.py", line 706, in run
dispatch, metrics = runner.run(raven_vars)
File "/home/user/raven/plugins/HERON/src/DispatchManager.py", line 206, in run
all_dispatch, metrics = self._do_dispatch(meta, all_structure, project_life, interp_years, segs, seg_type)
File "/home/user/raven/plugins/HERON/src/DispatchManager.py", line 282, in _do_dispatch
dispatch = self._dispatcher.dispatch(self._case, self._components, self._sources, meta)
File "/home/user/raven/plugins/HERON/src/dispatch/pyomo_dispatch.py", line 195, in dispatch
initial_levels, meta)
File "/home/user/raven/plugins/HERON/src/dispatch/pyomo_dispatch.py", line 318, in dispatch_window
raise RuntimeError
RuntimeError
Steps to Reproduce
Run a sufficiently complicated HERON input that has some failures.
Note that if it was run in serial, it would have worked.
Expected Behavior
If it works in serial, it should work in parallel.
Screenshots and Input Files
No response
OS
Linux
OS Version
No response
Dependency Manager
CONDA
For Change Control Board: Issue Review
- [ ] Is it tagged with a type: defect or task?
- [ ] Is it tagged with a priority: critical, normal or minor?
- [ ] If it will impact requirements or requirements tests, is it tagged with requirements?
- [ ] If it is a defect, can it cause wrong results for users? If so an email needs to be sent to the users.
- [ ] Is a rationale provided? (Such as explaining why the improvement is needed or why current code is wrong.)
For Change Control Board: Issue Closure
- [ ] If the issue is a defect, is the defect fixed?
- [ ] If the issue is a defect, is the defect tested for in the regression test system? (If not explain why not.)
- [ ] If the issue can impact users, has an email to the users group been written (the email should specify if the defect impacts stable or master)?
- [ ] If the issue is a defect, does it impact the latest release branch? If yes, is there any issue tagged with release (create if needed)?
- [ ] If the issue is being closed without a pull request, has an explanation of why it is being closed been provided?
Possibly this could be fixed by changing the code in Model.py line 302 from:
found = sampledVars.pop(whichVar,[notFound])
to
if hasattr(sampledVars, 'pop'):
found = sampledVars.pop(whichVar,[notFound])
else:
found = [notFound]
I think the heart of this problem is that finishedJob.getReturnCode()
is not evaluating to 1 for a failed code run. See MultiRun around line 176, if finishedJob.getReturnCode() == 0:
.
It's possible that some part of the EnsembleModel that HERON is running is either (1) not correctly flagging the job as a failed run, or (2) not handling the apparent failure. I'm not totally sure which should be happening, but surely half and half is not correct.
Hm, I wonder about this code in InternalRunner:
def getEvaluation(self):
"""
Method to return the results of the function evaluation associated with
this Runner
@ In, None
@ Out, returnValue, object or Error, whatever the method that this
instance is executing returns, or if the job failed, will return an
Error
"""
if self.isDone():
self._collectRunnerResponse()
if self.runReturn is None:
self.returnCode = -1
return Error()
return self.runReturn
else:
return Error()
It can return Error without setting returnCode to -1. I'll try changing that.
I have managed to get this error even with the fix from #1899, so that was not the full solution.
Added:
diff --git a/ravenframework/Models/Code.py b/ravenframework/Models/Code.py
index 3527d27f6..c6a4000fc 100644
--- a/ravenframework/Models/Code.py
+++ b/ravenframework/Models/Code.py
@@ -770,6 +770,9 @@ class Code(Model):
@ Out, None
"""
evaluation = finishedJob.getEvaluation()
+ if not hasattr(evaluation, 'pop'):
+ print("evaluation", evaluation, "job", finishedJob, finishedJob.getReturnCode())
+
self._replaceVariablesNamesWithAliasSystem(evaluation, 'input',True)
# in the event a batch is run, the evaluations will be a dict as {'RAVEN_isBatch':True, 'realizations': [...]}
and got:
evaluation job <ravenframework.Runners.DistributedMemoryRunner.DistributedMemor
yRunner object at 0x145fe072d190> -1
Traceback (most recent call last):
File "/home/fred/raven/raven_framework.py", line 26, in <module>
sys.exit(main(True))
File "/home/fred/raven/ravenframework/Driver.py", line 203, in m
ain
raven()
File "/home/fred/raven/ravenframework/Driver.py", line 156, in r
aven
simulation.run()
File "/home/fred/raven/ravenframework/Simulation.py", line 892, in run
self.executeStep(stepInputDict, stepInstance)
File "/home/fred/raven/ravenframework/Simulation.py", line 825, in executeStep
stepInstance.takeAstep(stepInputDict)
File "/home/fred/raven/ravenframework/Steps/Step.py", line 317, in takeAstep
self._localTakeAstepRun(inDictionary)
File "/home/fred/raven/ravenframework/Steps/MultiRun.py", line 178, in _localTakeAstepRun
myLambda([finishedJob,outputs[outIndex]])
File "/home/fred/raven/ravenframework/Steps/MultiRun.py", line 109, in <lambda>
self._outputCollectionLambda.append( (lambda x: inDictionary['Model'].collectOutput(x[0],x[1]), outIndex) )
File "/home/fred/raven/ravenframework/Models/Code.py", line 777, in collectOutput
self._replaceVariablesNamesWithAliasSystem(evaluation, 'input',True)
File "/home/fred/raven/ravenframework/Models/Model.py", line 302, in _replaceVariablesNamesWithAliasSystem
found = sampledVars.pop(whichVar,[notFound])
AttributeError: 'Error' object has no attribute 'pop'
So it is running even tho' returnCode is -1 by processing time.