pipelines [bug] set_retry not working

Environment

How do you deploy Kubeflow Pipelines (KFP)?
KFP version: 2.0.0 To find the version number, See version number shows on bottom of KFP UI left sidenav. -->
KFP SDK version: 2.0.1

Steps to reproduce

You can run the following code to reproduce the issue.

from kfp import dsl


@dsl.component
def random_failure_op(exit_codes: str):
    """A component that fails randomly."""
    import random
    import sys

    exit_code = int(random.choice(exit_codes.split(",")))
    print(exit_code)
    sys.exit(exit_code)


@dsl.pipeline(
    name="retry-random-failures",
    description="The pipeline includes two steps which fail randomly. It shows how to use ContainerOp(...).set_retry(...).",
)
def retry_random_failures():
    op1 = random_failure_op(exit_codes="0,1,2,3").set_retry(10)
    op2 = random_failure_op(exit_codes="0,1").set_retry(5)

Expected result

The component should retry on failure but it never does not even once. In the pipeline spec I also see the following policy

retryPolicy:
backoffDuration: 0s
backoffFactor: 2
backoffMaxDuration: 3600s
maxRetryCount: 10

Materials and reference

This is the documentation I referred - https://kubeflow-pipelines.readthedocs.io/en/latest/source/dsl.html#kfp.dsl.PipelineTask.set_retry

Labels

/area backend

Impacted by this bug? Give it a 👍.

Aug 31 '23 12:08 dugarsumitcheck24

In Kfp 2.0.1 retry on policy is not present.

Old Code

    def set_retry(self,
                  num_retries: int,
                  policy: Optional[str] = None,
                  backoff_duration: Optional[str] = None,
                  backoff_factor: Optional[float] = None,
                  backoff_max_duration: Optional[str] = None):
        """Sets the number of times the task is retried until it's declared
        failed.

        Args:
          num_retries: Number of times to retry on failures.
          policy: Retry policy name.
          backoff_duration: The time interval between retries. Defaults to an
            immediate retry. In case you specify a simple number, the unit
            defaults to seconds. You can also specify a different unit, for
            instance, 2m (2 minutes), 1h (1 hour).
          backoff_factor: The exponential backoff factor applied to
            backoff_duration. For example, if backoff_duration="60"
            (60 seconds) and backoff_factor=2, the first retry will happen
            after 60 seconds, then after 120, 240, and so on.
          backoff_max_duration: The maximum interval that can be reached with
            the backoff strategy.
        """
        if policy is not None and policy not in ALLOWED_RETRY_POLICIES:
            raise ValueError('policy must be one of: %r' %
                             (ALLOWED_RETRY_POLICIES,))

        self.num_retries = num_retries
        self.retry_policy = policy
        self.backoff_factor = backoff_factor
        self.backoff_duration = backoff_duration
        self.backoff_max_duration = backoff_max_duration
        return self

New Code

def set_retry(self,
                  num_retries: int,
                  backoff_duration: Optional[str] = None,
                  backoff_factor: Optional[float] = None,
                  backoff_max_duration: Optional[str] = None) -> 'PipelineTask':

        Args:
            num_retries : Number of times to retry on failure.
            backoff_duration: Number of seconds to wait before triggering a retry. Defaults to ``'0s'`` (immediate retry).
            backoff_factor: Exponential backoff factor applied to ``backoff_duration``. For example, if ``backoff_duration="60"`` (60 seconds) and ``backoff_factor=2``, the first retry will happen after 60 seconds, then again after 120, 240, and so on. Defaults to ``2.0``.
            backoff_max_duration: Maximum duration during which the task will be retried. Maximum duration is 1 hour (3600s). Defaults to ``'3600s'``.

        Returns:
            Self return to allow chained setting calls.
        """
        self._task_spec.retry_policy = structures.RetryPolicy(
            max_retry_count=num_retries,
            backoff_duration=backoff_duration,
            backoff_factor=backoff_factor,
            backoff_max_duration=backoff_max_duration,
        )
        return self

FYI @chensun

Sep 02 '23 12:09 ketangangal

retry-random-failures-xgqdv-1470953391 0/2 Completed 0 69s retry-random-failures-xgqdv-2197860257 0/2 Error 0 59s retry-random-failures-xgqdv-2342167526 0/2 Completed 0 69s retry-random-failures-xgqdv-3239894960 0/2 Error 0 58s retry-random-failures-xgqdv-611022813 0/2 Completed 0 79s

Sep 02 '23 13:09 ketangangal

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Dec 02 '23 07:12 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Mar 01 '24 07:03 github-actions[bot]

@dugarsumitcheck24 Faced with the same problem, did you manage to fix it?

Jul 25 '24 06:07 reuksv

@reuksv: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen @dugarsumitcheck24 Faced with the same problem, did you manage to fix it?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 25 '24 06:07 google-oss-prow[bot]

@dugarsumitcheck24 Faced with the same problem, did you manage to fix it?

@reuksv didn't really try again recently so not sure if the new versions fixes it or not.

Jul 25 '24 07:07 dugarsumit

The component should retry on failure but it never does not even once. In the pipeline spec I also see the following policy
retryPolicy:
backoffDuration: 0s
backoffFactor: 2
backoffMaxDuration: 3600s
maxRetryCount: 10

Faced with the same problem, and also found the same policy in the generated IR yaml.

Environment

How did you deploy Kubeflow Pipelines (KFP)? Kubeflow manifests standalone deployment (ref)
KFP version: 2.2.0
KFP SDK version: kfp 2.8.0 kfp-kubernetes 1.2.0 kfp-pipeline-spec 0.3.0 kfp-server-api 2.0.5

Aug 01 '24 06:08 JasonNS1425

/reopen

Aug 04 '24 19:08 gregsheremeta

@gregsheremeta: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Aug 04 '24 19:08 google-oss-prow[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Oct 05 '24 07:10 github-actions[bot]

/remove-lifecycle stale

Oct 07 '24 17:10 HumairAK

Are you testing this locally? I don't think set_retry works with local executions (see limitations). Using kfp==2.7.0 it works for me when submitting to vertex pipelines running the same code as you.

Oct 15 '24 12:10 ianbenlolo

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Dec 15 '24 07:12 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Jan 06 '25 07:01 github-actions[bot]

/reopen

Feb 25 '25 19:02 ntny

@ntny: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 25 '25 19:02 google-oss-prow[bot]

@ntny you need to update your PR.

Feb 28 '25 08:02 juliusvonkohout