digdag
digdag copied to clipboard
When we retry a workflow, its handling of the `_error` changes depending on resume option
Digdag version: 0.9.42
We have the _error
parameter and main tasks in our workflow.
example:
timezone: Asia/Tokyo
_error:
echo>: here is _error configuration.
+test:
+task1:
sh>: sleep 2; exit 0
+task2:
sh>: sleep 2; exit 1
In this example, we have an error in the task2
. And please assume we get some notification from the _error
and usually retry it. However, in retrying the attempt, whether _error
is called again is different depending on our option.
If we use --resume
digdag retry [id] --resume --latest-revision
The workflow failed again, and _error
parameter is ignored.
2020-07-04 12:36:30 +0900 [INFO] (XNIO-1 task-50): Starting a new session project id=1 workflow name=sample_1 session_time=2020-07-04T12:21:42+09:00
2020-07-04 12:36:30 +0900 [INFO] (0092@[0:sample]+sample_1+test+task2): sh>: sleep 2; exit 1
2020-07-04 12:36:32 +0900 [ERROR] (0092@[0:sample]+sample_1+test+task2): Task failed with unexpected error: Command failed with code 1
java.lang.RuntimeException: Command failed with code 1
at io.digdag.standards.operator.ShOperatorFactory$ShOperator.runTask(ShOperatorFactory.java:143)
at io.digdag.util.BaseOperator.run(BaseOperator.java:35)
at io.digdag.core.agent.OperatorManager.callExecutor(OperatorManager.java:315)
at io.digdag.core.agent.OperatorManager.runWithWorkspace(OperatorManager.java:257)
at io.digdag.core.agent.OperatorManager.lambda$runWithHeartbeat$2(OperatorManager.java:137)
at io.digdag.core.agent.OperatorManager$$Lambda$253/0000000000000000.run(Unknown Source)
at io.digdag.core.agent.ExtractArchiveWorkspaceManager.withExtractedArchive(ExtractArchiveWorkspaceManager.java:77)
at io.digdag.core.agent.OperatorManager.runWithHeartbeat(OperatorManager.java:135)
at io.digdag.core.agent.OperatorManager.run(OperatorManager.java:119)
at io.digdag.core.agent.MultiThreadAgent.lambda$null$0(MultiThreadAgent.java:127)
at io.digdag.core.agent.MultiThreadAgent$$Lambda$249/0000000000000000.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:821)
If we use --resume-from
digdag retry [id] --resume-from task2 --latest-revision
The workflow failed again, and operation in _error
parameter is called.
2020-07-04 12:40:24 +0900 [INFO] (XNIO-1 task-61): Starting a new session project id=1 workflow name=sample_1 session_time=2020-07-04T12:39:59+09:00
2020-07-04 12:40:24 +0900 [INFO] (0112@[0:sample]+sample_1+test+task2): sh>: sleep 2; exit 1
2020-07-04 12:40:26 +0900 [ERROR] (0112@[0:sample]+sample_1+test+task2): Task failed with unexpected error: Command failed with code 1
java.lang.RuntimeException: Command failed with code 1
at io.digdag.standards.operator.ShOperatorFactory$ShOperator.runTask(ShOperatorFactory.java:143)
at io.digdag.util.BaseOperator.run(BaseOperator.java:35)
at io.digdag.core.agent.OperatorManager.callExecutor(OperatorManager.java:315)
at io.digdag.core.agent.OperatorManager.runWithWorkspace(OperatorManager.java:257)
at io.digdag.core.agent.OperatorManager.lambda$runWithHeartbeat$2(OperatorManager.java:137)
at io.digdag.core.agent.OperatorManager$$Lambda$253/0000000000000000.run(Unknown Source)
at io.digdag.core.agent.ExtractArchiveWorkspaceManager.withExtractedArchive(ExtractArchiveWorkspaceManager.java:77)
at io.digdag.core.agent.OperatorManager.runWithHeartbeat(OperatorManager.java:135)
at io.digdag.core.agent.OperatorManager.run(OperatorManager.java:119)
at io.digdag.core.agent.MultiThreadAgent.lambda$null$0(MultiThreadAgent.java:127)
at io.digdag.core.agent.MultiThreadAgent$$Lambda$249/0000000000000000.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:821)
2020-07-04 12:40:26 +0900 [INFO] (0112@[0:sample]+sample_1^failure-alert): type: notify
2020-07-04 12:40:26 +0900 [INFO] (0110@[0:sample]+sample_1^error): echo>: here is _error configuration.
here is _error configuration.
Is this expected behavior? And should we always use --resume-from
if we want to receive some notification from the _error
parameter in retrying our workflows?