pipelines icon indicating copy to clipboard operation
pipelines copied to clipboard

[bug] {{workflow.status}} and {{workflow.failures}} Behavior in KFP v2 dsl.ExitHandler versus v1

Open tom-pavz opened this issue 1 year ago • 10 comments

Environment

  • How do you deploy Kubeflow Pipelines (KFP)? deployKF
  • KFP version: 2.1.0
  • KFP SDK version: 2.7.0

Steps to reproduce

We use dsl.ExitHanlder plus Argo Workflow's Workflow Variables to automatically report failed pipeline runs to slack. This was working well in KFP v1, but no longer in v2.

I have a simple hello-world pipeline that has the following fail component wrapped in a dsl.ExitHanlder with an exit_task of post_msg_to_slack_on_pipeline_fail.

@component
def fail():
    import sys

    sys.exit(1)
@component
def post_msg_to_slack_on_pipeline_fail():
    import json

    if "{{workflow.status}}" == "Failed":
        json_object = json.loads('{{workflow.failures}}'[1:-1])
        json_formatted_str = json.dumps(json_object, indent=2)
        # send alert to slack
@pipeline
def pipeline():
    post_msg_to_slack_on_pipeline_fail_task = post_msg_to_slack_on_pipeline_fail_op()

    with ExitHandler(exit_task=post_msg_to_slack_on_pipeline_fail_task):
        hello_task = hello_world_op()
        fail_task = fail_op()
        fail_task.after(hello_task)

Starting in KFP v2, at the time of the exit_task runtime, {{workflow.status}} gets a value of Running and the {{workflow.failures}} does not get any value and just prints out {{workflow.failures}} itself. It seems that for some reason in KFP V2, the failure of the task does not propagate up and out to the pipeline itself by the time the exit_task is running. In addition, I notice that the pipeline itself has an "Executed Successfully" status in the UI (see below screenshots of the DAG and sub-DAG), even though one of its tasks failed, which does not seem correct to me. I also notice that in the UI the exit-handler-1 sub-DAG is stuck in "Running" status, which also seems incorrect.

Expected result

The expected result, and the behavior we are getting when using v1 of KFP, was that at the time of the exit_task runtime, {{workflow.status}} got a value of Failed and the {{workflow.failures}} got a json file with information about the failed tasks of the pipeline, which we could then send to slack.

Materials and reference

Screenshots of the pipeline and its DAGs' in the UI: Screenshot 2024-06-17 at 12 40 38 PM Screenshot 2024-06-17 at 12 40 51 PM

I also attempted to use the dsl.PipelineTaskFinalStatus in a dsl.ExitHandler as detailed here but when trying to run the pipeline I am getting the following error.

{"error":"Internal error: compiling DAG \"root\": failed to add implicit deps: task final status not supported yet\nFailed to compile job

I saw this issue on this, but it was never really addressed as I suppose it was created in the wrong repo.


Impacted by this bug? Give it a 👍.

tom-pavz avatar Jun 17 '24 16:06 tom-pavz

This version of KFP came from a fork of the distribution, so I'm not sure how much it was in sync with the actual KFP 2.1.0. We also released KFP 2.2.0, and this is where we upgraded to Argo 3.4.x (which in part is what the issue in question should be resolved). Please, test with the latest KFP community release and let us know if the issue is fixed.

rimolive avatar Jun 17 '24 23:06 rimolive

@rimolive for reference, the fork is functionally identical for Kubeflow Pipeline 2.1.0 the only changes are about object store authentication for KFP v1.

The first issue raised by @tom-pavz may have been fixed in Kubeflow Pipelines 2.2.0, but it seems unlikely given I don't see any changes to the ExitHandler in 2.2.0.

Because @tom-pavz gave a very simple example, it should be straightforward to check on a vanilla Kubeflow Pipelines 2.2.0, perhaps even @tom-pavz could do that.


However, the second issue (using dsl.PipelineTaskFinalStatus in a dsl.ExitHandler) is definitely still not supported in 2.2.0, I can see the exact error code on this line:

  • https://github.com/kubeflow/pipelines/blob/2.2.0/backend/src/v2/compiler/argocompiler/dag.go#L484

thesuperzapper avatar Jun 18 '24 01:06 thesuperzapper

@rimolive @thesuperzapper Thanks for the responses! I am not super confident on how to test upgrading to 2.2.0 on my cluster without potentially introducing other unrelated issues. However, if no changes were made from 2.1.0 to 2.2.0 on the ExitHandler, can't we assume that this strange behavior I reported is still happening?

As for the dsl.PipelineTaskFinalStatus I guess I'm pretty surprised that this is broken in current KFP versions as it has been a part of the V2 KFP documentation for so long, how has no one else noticed that it is just not supported at all?

tom-pavz avatar Jun 18 '24 14:06 tom-pavz

@tom-pavz , regarding testing, I was more meaning to just deploy the raw manifests (not deployKF) on a testing cluster (even a local one) to confirm if the issue is still present.

Although, 1.9.0 is not technically out, so there is no way to use KFP 2.2.0 from the manifests yet (except with the release candidates).

But yes, it is very likely that both issues still exists in KFP 2.2.0.

thesuperzapper avatar Jun 18 '24 16:06 thesuperzapper

Hi @rimolive, we use the latest KFP version (2.2.0) and the latest SDK (2.7.0) and I can confirm that the issue exists. The ExitHandler somehow wraps failures and KFP UI reports SUCCESS for failed Runs. Therefore users have to check internally the execution status of each Run.

roytman avatar Jun 19 '24 07:06 roytman

Is the KFP community planning on fixing this in a future release now that it has been confirmed a couple of ways that it is indeed an issue? Seems like a regression in the ExitHandler functionality from v1 to v2.

Also, I think https://github.com/kubeflow/kubeflow/issues/7422 is definitely a real issue, albeit created in the wrong place. Should another issue be created in this repo for it, or should we use this issue for both of these issues?

tom-pavz avatar Jul 22 '24 20:07 tom-pavz

/assign HumairAK

HumairAK avatar Aug 01 '24 22:08 HumairAK

Okay, digged into this a bit.

tldr: we should try to use Argo Lifecycle Hooks here instead of having the exit_task be another sibling task to the ExitHandler.

Terminologies:

  • ExitHandler: Consists of a DAG of a group of tasks (taskgroup) and an exit_task
  • exit_task: The task that is invoked after exiting a group of other tasks.

There seem to be 2 user concerns here:

  1. Incorrect run status reporting for Runs using ExitHandler: pipeline overall run status inaccurately reports Succeeded when using an ExitHandler that includes a failing task, but succeeding exit_task
  2. Unable to detect ExitHandler status in exit_task: There is no way to access the status of the failed/error'd tasks in the exit handler's task group. This should satisfied by PipelineFinalTask but this is currently does not work in KFP.

I think it makes sense to have it in one issue however, as they are both very much related.

1. Incorrect run status reporting for Runs using `ExitHandler

In KFP V2 we do not use Argo's OnExit for KFP Exithandler. Based on the OP it seems like we used to do this in KFP V1. During Argo compilation in KFP V2, using an Exithandler results in another DAG, with the exit_task as the dependent task to this ExitHandler DAG.

The following illustrates what the exit_task's driver task depends looks like:

...
- arguments:
    parameters: ...
  depends: exit-handler-1.Succeeded || exit-handler-1.Skipped || exit-handler-1.Failed || exit-handler-1.Errored
  name: post-msg-driver # exit_task
  template: system-container-driver
...

As you can see the depends field will result in this task execution as soon as the exit-handler dag finishes. In Argo today, this results in the workflow succeeding if the remaining downstream tasks (post-msg in this case) succeed.

In the future this may be configurable.

There is a workaround suggested in the linked thread, however it requires us to match the exit status of the exit_status task with that of the tasks in the exit handler task group. I don't think we want to go this route.

Suggested Solutions:

I can see 2 ways we could resolve this.

One is to go back to using OnExit. Based on today's KFP community meeting call, @chensun suggested that one of the reasons why we did not go with argo's OnExit in V2 was because Argo didn't support multiple exithandlers. I suspect another reason might have been because it also didn't support passing parameters. This has since changed as Argo has introduced a more robust exit handler via a mechanism known as LifeCycle-Hook. Hooks should resolve most if not all these concerns, so my suggestion is we first attempt to utilize this method. Note that hooks can be implemented at the Argo template level.

If we can't use Hooks, another suggestion is for to adjust the pipeline server logic to instead be more "intelligent" in how we evaluate the Pipeline Run's condition, currently we just look at the Argo Workflow's phase field (i.e. here).

2. Unable to detect ExitHandler status in exit_task:

The variables {{workflow.status}} and {{workflow.failures}} are Argo ExitHandler specific.

We no longer want Argo implementation bleeding into pipelines, as the pipeline ought to be engine agnostic. So we should be providing better alternatives to the Argo Exithandler variables. We should instead work on supporting dsl.PipelineTaskFinalStatus.

Suggested Solutions: Once again we might be able to leverage Argo lifecycle hooks as an ExitHandler, then we can populate PipelineTaskFinalStatus with some combination of {{workflow.status}}, {{workflow.failures}}, and task output parameters internally without having the user to do this at the sdk stage.

If we don't go with hooks, then we could maybe leverage task.*.status and some combination of task input/output parameters to fill out the PipelineTaskFinalStatus fields.

My suggested path forward here is to first attempt to utilize the lifecycle hook here instead of treating the exit_task as yet another task within the parent dag. This should allow us to resolve both concerns.

@chensun / @zijianjoy let me know if there's anything obvious that stands out to you here for why we should not try to use hooks here.

HumairAK avatar Aug 14 '24 21:08 HumairAK

Link to KFP community meeting where this issue was discussed

gregsheremeta avatar Aug 15 '24 10:08 gregsheremeta

tldr: we should try to use Argo Lifecycle Hooks here instead of having the exit_task be another sibling task to the ExitHandler.

Thanks for the great writeup above. This approach makes sense to me.

gregsheremeta avatar Aug 15 '24 10:08 gregsheremeta

We encounter the same issue here which is blocking our move from kfp v1 to v2.

KFP version: 2.2.0
python: kfp 2.8.0 kfp-kubernetes 1.2.0 kfp-pipeline-spec 0.3.0 kfp-server-api 2.0.5

It would be great if this was solved!
I see that this issue was added to the KFP 2.4.0 milestone which is due by January 15 2025. When is this release expected to be available?

MarkTNO avatar Nov 04 '24 10:11 MarkTNO

@MarkTNO how can we find out if a kfp pipeline run was successful in v1 ?

vishal-MLE avatar Nov 17 '24 14:11 vishal-MLE

@vishal-MLE you mean via python kfp?
Like so: if "{{workflow.status}}" == "Succeeded":

MarkTNO avatar Nov 18 '24 15:11 MarkTNO

There seem to be 2 user concerns here:

  1. Incorrect run status reporting for Runs using ExitHandler: pipeline overall run status inaccurately reports Succeeded when using an ExitHandler that includes a failing task, but succeeding exit_task

Created a separate issue for this concern: https://github.com/kubeflow/pipelines/issues/11405

DharmitD avatar Nov 25 '24 17:11 DharmitD

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Mar 11 '25 07:03 github-actions[bot]

Just commenting so the issue does not become stale! I do see some activity on this early this year, wondering what version of KF / KFP we can expect fixes for these?

tom-pavz avatar Mar 11 '25 15:03 tom-pavz