flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[BUG] Task node overrides not handled in flyteidl are broken

Open fg91 opened this issue 2 years ago • 27 comments

Describe the bug

Let's take container_image as an example for a task node override:

@task
def foo() -> str:
    return "foo"


@workflow
def wf():
    foo().with_overrides(
        container_image="{{.image.default.fqn}}:" + "9410d0f0ed4a25577ab35a79bd3eb1119d8627d59c7a8fb947d42ca9fb46a61c"
    )

If one runs this workflow with pyflyte run, everything works as expected. If, however, one uses pyflyte register and runs it manually from the UI, the default image is used.


@workflow
def wf():
    foo().with_overrides(container_image="ubuntu:foo")
    foo().with_overrides(container_image="ubuntu:bar")

Another problem is that even if one uses pyflyte run, in this example both task nodes will use the image "ubuntu:bar".

Expected behavior

The reason for this behaviour is that only resource/extended resource overrides are treated as workflow-level task-node overrides which are registered separately from the task themselves and then overridden in the backend.

Other overrides such as container_image override the task-level metadata before the registration process. These can only be used once as "the last override wins" and appear to not work with pyflyte register.

The proper fix is to treat all overrides as task node overrides in flyteidl instead of overriding task-level metadata before registration.


  • [x] Container image
    • https://github.com/flyteorg/flyte/pull/4858
    • https://github.com/flyteorg/flytekit/pull/2176
  • [ ] ...

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

fg91 avatar Dec 07 '23 11:12 fg91

🤖

Hello @fg91,

I'm Dosu, an AI here to assist you with your issues and questions about flyte. While we're waiting for a human maintainer to respond, I'm here to provide preliminary help and guide you through the process. Let's work together to solve this bug!

Thank you for reporting this issue. I found a similar issue in the Flyte repository: [BUG] container_image override does not work with map_task. This issue was addressed in the following pull requests: Override the container image of map task and Allow contributors to self-assign issues with GitHub workflow.

However, I wasn't able to find specific details on how pyflyte run and pyflyte register handle the container_image override differently in the codebase. The underlying mechanism that allows container_image to be overridden in the Flyte codebase involves modifying the Image field of the CoreContainer struct in the flyteadmin package. This change will not affect running containers, only new ones that are created after the change.

Here is an example of how you might override the container_image:

container := flyteadmin.CoreContainer{
    Image: "new_image:tag",
    // other fields...
}

In this example, the container_image is overridden to be "new_image:tag". You can find more details in the flyteidl/gen/pb-go/flyteidl/service/flyteadmin/model_core_container.go file.

I hope this information is helpful. If you have more details or context that could help in understanding the issue better, please share.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Dec 07 '23 11:12 dosubot[bot]

As long as e.g. container_image is not treated as a task node override, one can work around the issues described above as follows:

@task
def foo() -> str:
    return "foo"


@dynamic(cache=True, cache_version="1")
def override_image(image: str) -> str:
    return foo().with_overrides(container_image=image)


@workflow
def wf():
    for image in ["ubuntu:foo", "ubuntu:bar", "ubuntu:baz"]:
        override_image(image=image)

This allows overriding the image multiple times and this also works with pyflyte register.

fg91 avatar Dec 08 '23 14:12 fg91

Follow-up issues

  • Add support overriding task config

pingsutw avatar Mar 05 '24 00:03 pingsutw

We're currently seeing this/similar with even basic image overrides in 1.11.0 (successful in 1.10.7):

from flytekit import task, workflow

@task(container_image='my-repo/flyte-evotorch:0.3.0')
def my_task():
    print("I'm a task!")

@workflow
def wf():
    my_task().with_overrides(container_image='my-repo/flyte-evotorch:0.0.1-xxxx')

Running via

pyflyte --verbose run --project ABC --remote wf

The task retains the @task(container_image=xxx) rather than using the .with_overrides specification

neilisaur avatar Apr 11 '24 21:04 neilisaur

We are also seeing this issue. Overriding container_image with a map_task worked in 1.10.7 but the override is not working in 1.11.0.

ai-rnatour avatar Apr 29 '24 18:04 ai-rnatour

We're looking into this.

@ai-rnatour , can you confirm which version of Flyte you're running? Are you using the helm charts or using specific versions of the docker images?

eapolinario avatar Apr 30 '24 00:04 eapolinario

@neilisaur, did you upgrade the backend server to 1.11 as well?

pingsutw avatar Apr 30 '24 05:04 pingsutw

Just tried for normal tasks as well as map tasks and can confirm that it works if both flytekit and the helm chart use version 1.11 upwards.

The reason for it seemingly breaking when upgrading flytekit from 1.10.x to 1.11 without upgrading the backend is that with flytekit 1.10.x, the overriding happens purely on the flytekit side with the limitation mentioned above that only the last override is honoured:

@workflow
def wf():
    foo().with_overrides(container_image="ubuntu:foo")
    foo().with_overrides(container_image="ubuntu:bar")  # wins

From flytekit 1.11 onwards, the container image override is properly sent to flyteadmin and overridden in the backend. If you upgrade flytekit but not the backend, the overriding does not happen in its broken form in flytekit but also not in the backend.

fg91 avatar Apr 30 '24 07:04 fg91

@pingsutw We're seeing this on 1.11.0 backend + 1.11.0 flytekit as well.

neilisaur avatar Apr 30 '24 14:04 neilisaur

@pingsutw We're seeing this on 1.11.0 backend + 1.11.0 flytekit as well.

This is peculiar 🤔 I see the following:

~ pip show flytekit
Name: flytekit
Version: 1.11.0
~ kubectl -n flyte get pod flytepropeller-6869c46cc6-llx69 -o yaml | grep image
    image: cr.flyte.org/flyteorg/flytepropeller-release:v1.11.0

Example workflow with map task (slightly adapted) from the docs:

@task
def detect_anomalies(data_point: int) -> bool:
    threshold=1
    return data_point > threshold

@workflow
def my_workflow() -> None:
    map_task(detect_anomalies)(data_point=[1, 2, 3]).with_overrides(container_image="ubuntu:foo")
~ kubectl get pod f92c1212999c245dabc7-n0-0-0 -o yaml | grep image
    image: ubuntu:foo

fg91 avatar Apr 30 '24 16:04 fg91

@neilisaur , @ai-rnatour , how are you verifying that the image is not being overridden?

eapolinario avatar Apr 30 '24 22:04 eapolinario

Just to expand on ^, if you're looking at the Task Template (in the flyteconsole task details tab) then you'll see the wrong info, as the container image (and other overridden fields) are only present at the workflow spec level.

In other words, in the backend we override the container image, but not in the task template. One could argue that this is misleading (and indeed is), but there's no regression in the functionality. We should open a separate gh issue to track this flyteconsole way of visualizing overrides.

eapolinario avatar Apr 30 '24 22:04 eapolinario

Before 1.11.0, the container image override wasn't actually treated properly as an override applied in the backend but the task template was overridden in flytekit before registration. Because of that, the override image was shown in the flyteconsole task details tab, potentially causing the confusion.

fg91 avatar Apr 30 '24 22:04 fg91

Just to expand on ^, if you're looking at the Task Template (in the flyteconsole task details tab) then you'll see the wrong info, as the container image (and other overridden fields) are only present at the workflow spec level.

In other words, in the backend we override the container image, but not in the task template. One could argue that this is misleading (and indeed is), but there's no regression in the functionality. We should open a separate gh issue to track this flyteconsole way of visualizing overrides.

@eapolinario In my case I checked the image on the pod using kubectl - suspected it was possible that the UI was bugged. I also know it is the wrong image because we use custom images for certain tasks and the runtime image didn't include any of the expected binaries.

ai-rnatour avatar Apr 30 '24 22:04 ai-rnatour

@ai-rnatour , can you confirm which version of propeller you're running? We should have been more explicit about this breaking change in flytekit. This change only works if the backend is also running >=1.11.0.

eapolinario avatar Apr 30 '24 22:04 eapolinario

@ai-rnatour , can you confirm which version of propeller you're running? We should have been more explicit about this breaking change in flytekit. This change only works if the backend is also running >=1.11.0.

I believe our propeller version is 1.10.7 based on the helm chart we're pulling in terraform

ai-rnatour avatar Apr 30 '24 23:04 ai-rnatour

That explains it. I'm going to amend the Flyte+flytekit 1.11.0 release notes. @ai-rnatour , can you update to a newer version of propeller and try again?

eapolinario avatar May 01 '24 23:05 eapolinario

I updated the release notes of both flytekit and Flyte 1.11.0 to mention the breaking change very prominently. Sorry for the confusion, @ai-rnatour !

I'm closing this issue. Let me know if this needs to be discussed again or more widely.

eapolinario avatar May 06 '24 18:05 eapolinario

Just to expand on ^, if you're looking at the Task Template (in the flyteconsole task details tab) then you'll see the wrong info, as the container image (and other overridden fields) are only present at the workflow spec level.

In other words, in the backend we override the container image, but not in the task template. One could argue that this is misleading (and indeed is), but there's no regression in the functionality. We should open a separate gh issue to track this flyteconsole way of visualizing overrides.

Sorry for the late reply, but just wanted to say that this seems to be what our user was seeing: incorrect info in flyteconsole, but the correct image on the pod on the cluster. Thank you!

neilisaur avatar May 08 '24 14:05 neilisaur

Unfortunately this issue isn't closed yet as the task_config override is still broken:

@task(
    task_config=Elastic(nnodes=1, nproc_per_node=2)
)
def foo():
    ...

@dynamic
def subwf():
    foo().with_overrides(task_config=Elastic(nnodes=2, nproc_per_node=2))

@workflow
def wf():
    subwf()

This example runs in a single pod as opposed to what the override states.


  • I tested all other overrides apart from aliases (no documentation for how this override is supposed to be used).
  • All others work apart from the cache, cache_version, ... ones which appear to not be implemented in the backend yet.

fg91 avatar May 23 '24 14:05 fg91