[BUG] Task node overrides not handled in flyteidl are broken
Describe the bug
Let's take container_image as an example for a task node override:
@task
def foo() -> str:
return "foo"
@workflow
def wf():
foo().with_overrides(
container_image="{{.image.default.fqn}}:" + "9410d0f0ed4a25577ab35a79bd3eb1119d8627d59c7a8fb947d42ca9fb46a61c"
)
If one runs this workflow with pyflyte run, everything works as expected. If, however, one uses pyflyte register and runs it manually from the UI, the default image is used.
@workflow
def wf():
foo().with_overrides(container_image="ubuntu:foo")
foo().with_overrides(container_image="ubuntu:bar")
Another problem is that even if one uses pyflyte run, in this example both task nodes will use the image "ubuntu:bar".
Expected behavior
The reason for this behaviour is that only resource/extended resource overrides are treated as workflow-level task-node overrides which are registered separately from the task themselves and then overridden in the backend.
Other overrides such as container_image override the task-level metadata before the registration process. These can only be used once as "the last override wins" and appear to not work with pyflyte register.
The proper fix is to treat all overrides as task node overrides in flyteidl instead of overriding task-level metadata before registration.
- [x] Container image
- https://github.com/flyteorg/flyte/pull/4858
- https://github.com/flyteorg/flytekit/pull/2176
- [ ] ...
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes
🤖
Hello @fg91,
I'm Dosu, an AI here to assist you with your issues and questions about flyte. While we're waiting for a human maintainer to respond, I'm here to provide preliminary help and guide you through the process. Let's work together to solve this bug!
Thank you for reporting this issue. I found a similar issue in the Flyte repository: [BUG] container_image override does not work with map_task. This issue was addressed in the following pull requests: Override the container image of map task and Allow contributors to self-assign issues with GitHub workflow.
However, I wasn't able to find specific details on how pyflyte run and pyflyte register handle the container_image override differently in the codebase. The underlying mechanism that allows container_image to be overridden in the Flyte codebase involves modifying the Image field of the CoreContainer struct in the flyteadmin package. This change will not affect running containers, only new ones that are created after the change.
Here is an example of how you might override the container_image:
container := flyteadmin.CoreContainer{
Image: "new_image:tag",
// other fields...
}
In this example, the container_image is overridden to be "new_image:tag". You can find more details in the flyteidl/gen/pb-go/flyteidl/service/flyteadmin/model_core_container.go file.
I hope this information is helpful. If you have more details or context that could help in understanding the issue better, please share.
Sources
- [BUG] container_image override does not work with map_task
- flyteidl/gen/pb-go/flyteidl/service/flyteadmin/model_core_container.go
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
As long as e.g. container_image is not treated as a task node override, one can work around the issues described above as follows:
@task
def foo() -> str:
return "foo"
@dynamic(cache=True, cache_version="1")
def override_image(image: str) -> str:
return foo().with_overrides(container_image=image)
@workflow
def wf():
for image in ["ubuntu:foo", "ubuntu:bar", "ubuntu:baz"]:
override_image(image=image)
This allows overriding the image multiple times and this also works with pyflyte register.
Follow-up issues
- Add support overriding task config
We're currently seeing this/similar with even basic image overrides in 1.11.0 (successful in 1.10.7):
from flytekit import task, workflow
@task(container_image='my-repo/flyte-evotorch:0.3.0')
def my_task():
print("I'm a task!")
@workflow
def wf():
my_task().with_overrides(container_image='my-repo/flyte-evotorch:0.0.1-xxxx')
Running via
pyflyte --verbose run --project ABC --remote wf
The task retains the @task(container_image=xxx) rather than using the .with_overrides specification
We are also seeing this issue. Overriding container_image with a map_task worked in 1.10.7 but the override is not working in 1.11.0.
We're looking into this.
@ai-rnatour , can you confirm which version of Flyte you're running? Are you using the helm charts or using specific versions of the docker images?
@neilisaur, did you upgrade the backend server to 1.11 as well?
Just tried for normal tasks as well as map tasks and can confirm that it works if both flytekit and the helm chart use version 1.11 upwards.
The reason for it seemingly breaking when upgrading flytekit from 1.10.x to 1.11 without upgrading the backend is that with flytekit 1.10.x, the overriding happens purely on the flytekit side with the limitation mentioned above that only the last override is honoured:
@workflow
def wf():
foo().with_overrides(container_image="ubuntu:foo")
foo().with_overrides(container_image="ubuntu:bar") # wins
From flytekit 1.11 onwards, the container image override is properly sent to flyteadmin and overridden in the backend. If you upgrade flytekit but not the backend, the overriding does not happen in its broken form in flytekit but also not in the backend.
@pingsutw We're seeing this on 1.11.0 backend + 1.11.0 flytekit as well.
@pingsutw We're seeing this on 1.11.0 backend + 1.11.0 flytekit as well.
This is peculiar 🤔 I see the following:
~ pip show flytekit
Name: flytekit
Version: 1.11.0
~ kubectl -n flyte get pod flytepropeller-6869c46cc6-llx69 -o yaml | grep image
image: cr.flyte.org/flyteorg/flytepropeller-release:v1.11.0
Example workflow with map task (slightly adapted) from the docs:
@task
def detect_anomalies(data_point: int) -> bool:
threshold=1
return data_point > threshold
@workflow
def my_workflow() -> None:
map_task(detect_anomalies)(data_point=[1, 2, 3]).with_overrides(container_image="ubuntu:foo")
~ kubectl get pod f92c1212999c245dabc7-n0-0-0 -o yaml | grep image
image: ubuntu:foo
@neilisaur , @ai-rnatour , how are you verifying that the image is not being overridden?
Just to expand on ^, if you're looking at the Task Template (in the flyteconsole task details tab) then you'll see the wrong info, as the container image (and other overridden fields) are only present at the workflow spec level.
In other words, in the backend we override the container image, but not in the task template. One could argue that this is misleading (and indeed is), but there's no regression in the functionality. We should open a separate gh issue to track this flyteconsole way of visualizing overrides.
Before 1.11.0, the container image override wasn't actually treated properly as an override applied in the backend but the task template was overridden in flytekit before registration. Because of that, the override image was shown in the flyteconsole task details tab, potentially causing the confusion.
Just to expand on ^, if you're looking at the Task Template (in the flyteconsole task details tab) then you'll see the wrong info, as the container image (and other overridden fields) are only present at the workflow spec level.
In other words, in the backend we override the container image, but not in the task template. One could argue that this is misleading (and indeed is), but there's no regression in the functionality. We should open a separate gh issue to track this flyteconsole way of visualizing overrides.
@eapolinario In my case I checked the image on the pod using kubectl - suspected it was possible that the UI was bugged. I also know it is the wrong image because we use custom images for certain tasks and the runtime image didn't include any of the expected binaries.
@ai-rnatour , can you confirm which version of propeller you're running? We should have been more explicit about this breaking change in flytekit. This change only works if the backend is also running >=1.11.0.
@ai-rnatour , can you confirm which version of propeller you're running? We should have been more explicit about this breaking change in flytekit. This change only works if the backend is also running >=1.11.0.
I believe our propeller version is 1.10.7 based on the helm chart we're pulling in terraform
That explains it. I'm going to amend the Flyte+flytekit 1.11.0 release notes. @ai-rnatour , can you update to a newer version of propeller and try again?
I updated the release notes of both flytekit and Flyte 1.11.0 to mention the breaking change very prominently. Sorry for the confusion, @ai-rnatour !
I'm closing this issue. Let me know if this needs to be discussed again or more widely.
Just to expand on ^, if you're looking at the Task Template (in the flyteconsole task details tab) then you'll see the wrong info, as the container image (and other overridden fields) are only present at the workflow spec level.
In other words, in the backend we override the container image, but not in the task template. One could argue that this is misleading (and indeed is), but there's no regression in the functionality. We should open a separate gh issue to track this flyteconsole way of visualizing overrides.
Sorry for the late reply, but just wanted to say that this seems to be what our user was seeing: incorrect info in flyteconsole, but the correct image on the pod on the cluster. Thank you!
Unfortunately this issue isn't closed yet as the task_config override is still broken:
@task(
task_config=Elastic(nnodes=1, nproc_per_node=2)
)
def foo():
...
@dynamic
def subwf():
foo().with_overrides(task_config=Elastic(nnodes=2, nproc_per_node=2))
@workflow
def wf():
subwf()
This example runs in a single pod as opposed to what the override states.
- I tested all other overrides apart from
aliases(no documentation for how this override is supposed to be used). - All others work apart from the
cache,cache_version, ... ones which appear to not be implemented in the backend yet.