digdag When we retry a workflow, its handling of the `_error` changes depending on resume option

When we retry a workflow, its handling of the `_error` changes depending on resume option

Open to-lz1 opened this issue 4 years ago • 1 comments

Digdag version: 0.9.42

We have the _error parameter and main tasks in our workflow.

example:

timezone: Asia/Tokyo

_error:
  echo>: here is _error configuration.

+test:
  +task1:
    sh>: sleep 2; exit 0
  +task2:
    sh>: sleep 2; exit 1

In this example, we have an error in the task2. And please assume we get some notification from the _error and usually retry it. However, in retrying the attempt, whether _error is called again is different depending on our option.

If we use `--resume`

digdag retry [id] --resume --latest-revision

The workflow failed again, and _error parameter is ignored.

2020-07-04 12:36:30 +0900 [INFO] (XNIO-1 task-50): Starting a new session project id=1 workflow name=sample_1 session_time=2020-07-04T12:21:42+09:00
2020-07-04 12:36:30 +0900 [INFO] (0092@[0:sample]+sample_1+test+task2): sh>: sleep 2; exit 1
2020-07-04 12:36:32 +0900 [ERROR] (0092@[0:sample]+sample_1+test+task2): Task failed with unexpected error: Command failed with code 1
java.lang.RuntimeException: Command failed with code 1
	at io.digdag.standards.operator.ShOperatorFactory$ShOperator.runTask(ShOperatorFactory.java:143)
	at io.digdag.util.BaseOperator.run(BaseOperator.java:35)
	at io.digdag.core.agent.OperatorManager.callExecutor(OperatorManager.java:315)
	at io.digdag.core.agent.OperatorManager.runWithWorkspace(OperatorManager.java:257)
	at io.digdag.core.agent.OperatorManager.lambda$runWithHeartbeat$2(OperatorManager.java:137)
	at io.digdag.core.agent.OperatorManager$$Lambda$253/0000000000000000.run(Unknown Source)
	at io.digdag.core.agent.ExtractArchiveWorkspaceManager.withExtractedArchive(ExtractArchiveWorkspaceManager.java:77)
	at io.digdag.core.agent.OperatorManager.runWithHeartbeat(OperatorManager.java:135)
	at io.digdag.core.agent.OperatorManager.run(OperatorManager.java:119)
	at io.digdag.core.agent.MultiThreadAgent.lambda$null$0(MultiThreadAgent.java:127)
	at io.digdag.core.agent.MultiThreadAgent$$Lambda$249/0000000000000000.run(Unknown Source)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:821)

If we use `--resume-from`

digdag retry [id] --resume-from task2 --latest-revision

The workflow failed again, and operation in _error parameter is called.

2020-07-04 12:40:24 +0900 [INFO] (XNIO-1 task-61): Starting a new session project id=1 workflow name=sample_1 session_time=2020-07-04T12:39:59+09:00
2020-07-04 12:40:24 +0900 [INFO] (0112@[0:sample]+sample_1+test+task2): sh>: sleep 2; exit 1
2020-07-04 12:40:26 +0900 [ERROR] (0112@[0:sample]+sample_1+test+task2): Task failed with unexpected error: Command failed with code 1
java.lang.RuntimeException: Command failed with code 1
	at io.digdag.standards.operator.ShOperatorFactory$ShOperator.runTask(ShOperatorFactory.java:143)
	at io.digdag.util.BaseOperator.run(BaseOperator.java:35)
	at io.digdag.core.agent.OperatorManager.callExecutor(OperatorManager.java:315)
	at io.digdag.core.agent.OperatorManager.runWithWorkspace(OperatorManager.java:257)
	at io.digdag.core.agent.OperatorManager.lambda$runWithHeartbeat$2(OperatorManager.java:137)
	at io.digdag.core.agent.OperatorManager$$Lambda$253/0000000000000000.run(Unknown Source)
	at io.digdag.core.agent.ExtractArchiveWorkspaceManager.withExtractedArchive(ExtractArchiveWorkspaceManager.java:77)
	at io.digdag.core.agent.OperatorManager.runWithHeartbeat(OperatorManager.java:135)
	at io.digdag.core.agent.OperatorManager.run(OperatorManager.java:119)
	at io.digdag.core.agent.MultiThreadAgent.lambda$null$0(MultiThreadAgent.java:127)
	at io.digdag.core.agent.MultiThreadAgent$$Lambda$249/0000000000000000.run(Unknown Source)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:821)
2020-07-04 12:40:26 +0900 [INFO] (0112@[0:sample]+sample_1^failure-alert): type: notify
2020-07-04 12:40:26 +0900 [INFO] (0110@[0:sample]+sample_1^error): echo>: here is _error configuration.
here is _error configuration.

Is this expected behavior? And should we always use --resume-from if we want to receive some notification from the _error parameter in retrying our workflows?

Jul 04 '20 03:07 to-lz1

digdag digdag copied to clipboard

When we retry a workflow, its handling of the `_error` changes depending on resume option

If we use --resume

If we use --resume-from

digdag
digdag copied to clipboard

If we use `--resume`

If we use `--resume-from`