flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[BUG] Flyte array plugin fails with "object has been modified"

Open pablocasares opened this issue 4 months ago • 3 comments

Describe the bug

Flyte array plugin tasks fail because it seems that the pod is being modified externally and then the pod information stored in the propeller does not correspond to the actual one in the cluster.

The error is:

Workflow[ingestion-pipeline:production:ingestion_pipeline.ingestion.ingestion_workflow] failed. RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: failed at Node[n1]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [k8s-array]: Operation cannot be fulfilled on pods "kac2sxex6kvectdvx3vk-n3-0-n1-0-1195": the object has been modified; please apply your changes to the latest version and try again

It seems that the system is retrying that 50 times but I think that the new pod information is never fetched again from the cluster. Retrying 50 times will not help in that case.

Maybe there's a missing Pod.Get() around these lines: https://github.com/flyteorg/flyte/blob/master/flyteplugins/go/tasks/plugins/array/k8s/subtask.go#L108-L141

resourceToFinalize seems to be always an empty skeleton because there's no Get operation done on the actual pod.

Expected behavior

If the pod changes externally the plugin should detect that and properly refresh the object while doing the retries.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

pablocasares avatar Apr 23 '24 14:04 pablocasares

The issue you're encountering with the Flyte array plugin, where tasks fail due to external modifications to the pod, is related to the handling of Kubernetes resources by Flyte's propeller. This problem typically arises when there are concurrent updates to the same Kubernetes resource, possibly from different controllers or manual interventions.

In the context of Flyte, when a pod is modified externally, the stored version in FlytePropeller becomes outdated, leading to conflicts. The error message you see indicates that the system has tried to apply changes based on an outdated version of the pod, which Kubernetes rejects to prevent unintended overwrites.

The solution might involve implementing a mechanism to fetch the latest pod state from the Kubernetes API before retrying operations, ensuring that FlytePropeller operates on the current state of the pod. This could be achieved by adding a Pod.Get() operation, as you suggested, to refresh the pod's state in the Flyte system before attempting retries or updates.

This issue is not explicitly covered in the provided data sources, so the exact implementation details or confirmation of this behavior in Flyte's codebase cannot be verified from the sources. However, the general behavior of Kubernetes and the pattern of errors you described align with common issues faced in systems interacting with Kubernetes resources under concurrent modifications.

Would you like more detailed guidance on how to potentially implement this fix or further investigate this issue in the Flyte codebase?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

runllm[bot] avatar Apr 23 '24 14:04 runllm[bot]

We should enable finalizers maybe

kumare3 avatar Apr 24 '24 03:04 kumare3

We are already setting https://github.com/flyteorg/flyte/blob/876999ca9c64025315b52d2d5b04195c530fa3df/flytepropeller/propeller-config.yaml#L48 in our setup.

Would the get() operation I mentioned help there?

pablocasares avatar Apr 25 '24 12:04 pablocasares