[bug] set_retry not working
Environment
- How do you deploy Kubeflow Pipelines (KFP)?
- KFP version: 2.0.0 To find the version number, See version number shows on bottom of KFP UI left sidenav. -->
- KFP SDK version: 2.0.1
Steps to reproduce
You can run the following code to reproduce the issue.
from kfp import dsl
@dsl.component
def random_failure_op(exit_codes: str):
"""A component that fails randomly."""
import random
import sys
exit_code = int(random.choice(exit_codes.split(",")))
print(exit_code)
sys.exit(exit_code)
@dsl.pipeline(
name="retry-random-failures",
description="The pipeline includes two steps which fail randomly. It shows how to use ContainerOp(...).set_retry(...).",
)
def retry_random_failures():
op1 = random_failure_op(exit_codes="0,1,2,3").set_retry(10)
op2 = random_failure_op(exit_codes="0,1").set_retry(5)
Expected result
The component should retry on failure but it never does not even once. In the pipeline spec I also see the following policy
retryPolicy:
backoffDuration: 0s
backoffFactor: 2
backoffMaxDuration: 3600s
maxRetryCount: 10
Materials and reference
This is the documentation I referred - https://kubeflow-pipelines.readthedocs.io/en/latest/source/dsl.html#kfp.dsl.PipelineTask.set_retry
Labels
/area backend
Impacted by this bug? Give it a 👍.
In Kfp 2.0.1 retry on policy is not present.
Old Code
def set_retry(self,
num_retries: int,
policy: Optional[str] = None,
backoff_duration: Optional[str] = None,
backoff_factor: Optional[float] = None,
backoff_max_duration: Optional[str] = None):
"""Sets the number of times the task is retried until it's declared
failed.
Args:
num_retries: Number of times to retry on failures.
policy: Retry policy name.
backoff_duration: The time interval between retries. Defaults to an
immediate retry. In case you specify a simple number, the unit
defaults to seconds. You can also specify a different unit, for
instance, 2m (2 minutes), 1h (1 hour).
backoff_factor: The exponential backoff factor applied to
backoff_duration. For example, if backoff_duration="60"
(60 seconds) and backoff_factor=2, the first retry will happen
after 60 seconds, then after 120, 240, and so on.
backoff_max_duration: The maximum interval that can be reached with
the backoff strategy.
"""
if policy is not None and policy not in ALLOWED_RETRY_POLICIES:
raise ValueError('policy must be one of: %r' %
(ALLOWED_RETRY_POLICIES,))
self.num_retries = num_retries
self.retry_policy = policy
self.backoff_factor = backoff_factor
self.backoff_duration = backoff_duration
self.backoff_max_duration = backoff_max_duration
return self
New Code
def set_retry(self,
num_retries: int,
backoff_duration: Optional[str] = None,
backoff_factor: Optional[float] = None,
backoff_max_duration: Optional[str] = None) -> 'PipelineTask':
Args:
num_retries : Number of times to retry on failure.
backoff_duration: Number of seconds to wait before triggering a retry. Defaults to ``'0s'`` (immediate retry).
backoff_factor: Exponential backoff factor applied to ``backoff_duration``. For example, if ``backoff_duration="60"`` (60 seconds) and ``backoff_factor=2``, the first retry will happen after 60 seconds, then again after 120, 240, and so on. Defaults to ``2.0``.
backoff_max_duration: Maximum duration during which the task will be retried. Maximum duration is 1 hour (3600s). Defaults to ``'3600s'``.
Returns:
Self return to allow chained setting calls.
"""
self._task_spec.retry_policy = structures.RetryPolicy(
max_retry_count=num_retries,
backoff_duration=backoff_duration,
backoff_factor=backoff_factor,
backoff_max_duration=backoff_max_duration,
)
return self
FYI @chensun
retry-random-failures-xgqdv-1470953391 0/2 Completed 0 69s retry-random-failures-xgqdv-2197860257 0/2 Error 0 59s retry-random-failures-xgqdv-2342167526 0/2 Completed 0 69s retry-random-failures-xgqdv-3239894960 0/2 Error 0 58s retry-random-failures-xgqdv-611022813 0/2 Completed 0 79s
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
@dugarsumitcheck24 Faced with the same problem, did you manage to fix it?
@reuksv: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen @dugarsumitcheck24 Faced with the same problem, did you manage to fix it?
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@dugarsumitcheck24 Faced with the same problem, did you manage to fix it?
@reuksv didn't really try again recently so not sure if the new versions fixes it or not.
The component should retry on failure but it never does not even once. In the pipeline spec I also see the following policyretryPolicy: backoffDuration: 0s backoffFactor: 2 backoffMaxDuration: 3600s maxRetryCount: 10
Faced with the same problem, and also found the same policy in the generated IR yaml.
Environment
- How did you deploy Kubeflow Pipelines (KFP)? Kubeflow manifests standalone deployment (ref)
- KFP version: 2.2.0
- KFP SDK version: kfp 2.8.0 kfp-kubernetes 1.2.0 kfp-pipeline-spec 0.3.0 kfp-server-api 2.0.5
/reopen
@gregsheremeta: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
Are you testing this locally? I don't think set_retry works with local executions (see limitations). Using kfp==2.7.0 it works for me when submitting to vertex pipelines running the same code as you.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
/reopen
@ntny: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@ntny you need to update your PR.