raven icon indicating copy to clipboard operation
raven copied to clipboard

[DEFECT] Throwing exception when running in parallel if Code fails.

Open joshua-cogliati-inl opened this issue 2 years ago • 5 comments

Thank you for the defect report

Defect Description

In a Raven Runs Raven input, it gets this error when running in parallel:

Traceback (most recent call last):
  File "/home/user/raven/raven_framework.py", line 26, in <module>
    sys.exit(main(True))
  File "/home/user/raven/ravenframework/Driver.py", line 203, in main
    raven()
  File "/home/user/raven/ravenframework/Driver.py", line 156, in raven
    simulation.run()
  File "/home/user/raven/ravenframework/Simulation.py", line 892, in run
    self.executeStep(stepInputDict, stepInstance)
  File "/home/user/raven/ravenframework/Simulation.py", line 825, in executeStep
    stepInstance.takeAstep(stepInputDict)
  File "/home/user/raven/ravenframework/Steps/Step.py", line 317, in takeAstep
    self._localTakeAstepRun(inDictionary)
  File "/home/user/raven/ravenframework/Steps/MultiRun.py", line 178, in _localTakeAstepRun
    myLambda([finishedJob,outputs[outIndex]])
  File "/home/user/raven/ravenframework/Steps/MultiRun.py", line 109, in <lambda>
    self._outputCollectionLambda.append( (lambda x: inDictionary['Model'].collectOutput(x[0],x[1]), outIndex) )
  File "/home/user/raven/ravenframework/Models/Code.py", line 774, in collectOutput
    self._replaceVariablesNamesWithAliasSystem(evaluation, 'input',True)
  File "/home/user/raven/ravenframework/Models/Model.py", line 302, in _replaceVariablesNamesWithAliasSystem
    found = sampledVars.pop(whichVar,[notFound])
AttributeError: 'Error' object has no attribute 'pop'

The pyomo failed:

WARNING: Loading a SolverResults object with a warning status into
    model.name="unknown";
      - termination condition: infeasible
      - message from solver: <undefined>
DEBUGG ... solve was unsuccessful!
DEBUGG ... status: warning
DEBUGG ... termination: infeasible
Traceback (most recent call last):
          File "/home/user/raven/ravenframework/Models/EnsembleModel.py", line 718, in __advanceModel
            evaluation = modelToExecute['Instance'].evaluateSample.original_function(modelToExecute['Instance'], origInputList, samplerType, inputKwargs)
          File "/home/user/raven/ravenframework/Models/ExternalModel.py", line 324, in evaluateSample
            result,instSelf = self._externalRun(inRun,)
          File "/home/user/raven/ravenframework/Models/ExternalModel.py", line 266, in _externalRun
            self.sim.run(externalSelf, InputDict)
          File "/home/user/raven/plugins/HERON/src/DispatchManager.py", line 706, in run
            dispatch, metrics = runner.run(raven_vars)
          File "/home/user/raven/plugins/HERON/src/DispatchManager.py", line 206, in run
            all_dispatch, metrics = self._do_dispatch(meta, all_structure, project_life, interp_years, segs, seg_type)
          File "/home/user/raven/plugins/HERON/src/DispatchManager.py", line 282, in _do_dispatch
            dispatch = self._dispatcher.dispatch(self._case, self._components, self._sources, meta)
          File "/home/user/raven/plugins/HERON/src/dispatch/pyomo_dispatch.py", line 195, in dispatch
            initial_levels, meta)
          File "/home/user/raven/plugins/HERON/src/dispatch/pyomo_dispatch.py", line 318, in dispatch_window
            raise RuntimeError
        RuntimeError

Steps to Reproduce

Run a sufficiently complicated HERON input that has some failures.

Note that if it was run in serial, it would have worked.

Expected Behavior

If it works in serial, it should work in parallel.

Screenshots and Input Files

No response

OS

Linux

OS Version

No response

Dependency Manager

CONDA

For Change Control Board: Issue Review

  • [ ] Is it tagged with a type: defect or task?
  • [ ] Is it tagged with a priority: critical, normal or minor?
  • [ ] If it will impact requirements or requirements tests, is it tagged with requirements?
  • [ ] If it is a defect, can it cause wrong results for users? If so an email needs to be sent to the users.
  • [ ] Is a rationale provided? (Such as explaining why the improvement is needed or why current code is wrong.)

For Change Control Board: Issue Closure

  • [ ] If the issue is a defect, is the defect fixed?
  • [ ] If the issue is a defect, is the defect tested for in the regression test system? (If not explain why not.)
  • [ ] If the issue can impact users, has an email to the users group been written (the email should specify if the defect impacts stable or master)?
  • [ ] If the issue is a defect, does it impact the latest release branch? If yes, is there any issue tagged with release (create if needed)?
  • [ ] If the issue is being closed without a pull request, has an explanation of why it is being closed been provided?

joshua-cogliati-inl avatar Jul 19 '22 20:07 joshua-cogliati-inl

Possibly this could be fixed by changing the code in Model.py line 302 from:

          found = sampledVars.pop(whichVar,[notFound])

to

          if hasattr(sampledVars, 'pop'):
            found = sampledVars.pop(whichVar,[notFound])
          else:
            found = [notFound]

joshua-cogliati-inl avatar Jul 19 '22 20:07 joshua-cogliati-inl

I think the heart of this problem is that finishedJob.getReturnCode() is not evaluating to 1 for a failed code run. See MultiRun around line 176, if finishedJob.getReturnCode() == 0:.

It's possible that some part of the EnsembleModel that HERON is running is either (1) not correctly flagging the job as a failed run, or (2) not handling the apparent failure. I'm not totally sure which should be happening, but surely half and half is not correct.

PaulTalbot-INL avatar Jul 19 '22 20:07 PaulTalbot-INL

Hm, I wonder about this code in InternalRunner:

  def getEvaluation(self):
    """
      Method to return the results of the function evaluation associated with
      this Runner
      @ In, None
      @ Out, returnValue, object or Error, whatever the method that this
        instance is executing returns, or if the job failed, will return an
        Error
    """
    if self.isDone():
      self._collectRunnerResponse()
      if self.runReturn is None:
        self.returnCode = -1
        return Error()
      return self.runReturn
    else:
      return Error()

It can return Error without setting returnCode to -1. I'll try changing that.

joshua-cogliati-inl avatar Jul 19 '22 21:07 joshua-cogliati-inl

I have managed to get this error even with the fix from #1899, so that was not the full solution.

joshua-cogliati-inl avatar Jul 20 '22 18:07 joshua-cogliati-inl

Added:

diff --git a/ravenframework/Models/Code.py b/ravenframework/Models/Code.py
index 3527d27f6..c6a4000fc 100644
--- a/ravenframework/Models/Code.py
+++ b/ravenframework/Models/Code.py
@@ -770,6 +770,9 @@ class Code(Model):
       @ Out, None
     """
     evaluation = finishedJob.getEvaluation()
+    if not hasattr(evaluation, 'pop'):
+      print("evaluation", evaluation, "job", finishedJob, finishedJob.getReturnCode())
+
 
     self._replaceVariablesNamesWithAliasSystem(evaluation, 'input',True)
     # in the event a batch is run, the evaluations will be a dict as {'RAVEN_isBatch':True, 'realizations': [...]}

and got:

evaluation  job <ravenframework.Runners.DistributedMemoryRunner.DistributedMemor
yRunner object at 0x145fe072d190> -1
Traceback (most recent call last):
  File "/home/fred/raven/raven_framework.py", line 26, in <module>
    sys.exit(main(True))
  File "/home/fred/raven/ravenframework/Driver.py", line 203, in m
ain
    raven()
  File "/home/fred/raven/ravenframework/Driver.py", line 156, in r
aven
    simulation.run()
  File "/home/fred/raven/ravenframework/Simulation.py", line 892, in run
    self.executeStep(stepInputDict, stepInstance)
  File "/home/fred/raven/ravenframework/Simulation.py", line 825, in executeStep
    stepInstance.takeAstep(stepInputDict)
  File "/home/fred/raven/ravenframework/Steps/Step.py", line 317, in takeAstep
    self._localTakeAstepRun(inDictionary)
  File "/home/fred/raven/ravenframework/Steps/MultiRun.py", line 178, in _localTakeAstepRun
    myLambda([finishedJob,outputs[outIndex]])
  File "/home/fred/raven/ravenframework/Steps/MultiRun.py", line 109, in <lambda>
    self._outputCollectionLambda.append( (lambda x: inDictionary['Model'].collectOutput(x[0],x[1]), outIndex) )
  File "/home/fred/raven/ravenframework/Models/Code.py", line 777, in collectOutput
    self._replaceVariablesNamesWithAliasSystem(evaluation, 'input',True)
  File "/home/fred/raven/ravenframework/Models/Model.py", line 302, in _replaceVariablesNamesWithAliasSystem
    found = sampledVars.pop(whichVar,[notFound])
AttributeError: 'Error' object has no attribute 'pop'

So it is running even tho' returnCode is -1 by processing time.

joshua-cogliati-inl avatar Jul 21 '22 13:07 joshua-cogliati-inl