flyte [BUG] CreateContainerError being misclassified as UserError and not honoring retries

Flyte & Flytekit version

flyte version: v1.15.3 flytekit version: v1.16.5

Describe the bug

When a pod fails to start due to a ContainersNotReady|CreateContainerError (particularly "failed to reserve container name"), and the error message includes "Grace period exceeded", the error is incorrectly classified as kind: USER instead of kind: SYSTEM.

This causes a gap where neither retry mechanism triggers:

System retries don't apply because the error is classified as USER
User retries don't apply because the container never started and the task code never ran

The task fails immediately with no retry attempts, even though retries=5 is configured on the task.

Expected behavior

ContainersNotReady|CreateContainerError/grace period exceeded should be classified as a SYSTEM error, which would allow max-node-retries-system-failures to automatically retry the task. Alternatively, user retries should be triggered for pre-execution failures when the error is classified as USER.

Additional context to reproduce

Example error message: Grace period [3m0s] exceeded|containers with unready status: [primary]|failed to reserve container name "primary_xxxxx_namespace_uuid_0": name "primary_xxxxx_namespace_uuid_0" is reserved for "containerid123..."

Example CRD information:

nodeStatus:
  n0:
    error:
      code: ContainersNotReady|CreateContainerError
      kind: USER
      message: 'Grace period [3m0s] exceeded|containers with unready status: [primary]|failed
        to reserve container name...'
    phase: 6

Screenshots

No response

Are you sure this issue hasn't been raised already?

[x] Yes

Have you read the Code of Conduct?

[x] Yes

Dec 03 '25 15:12 charliemoriarty

This is a known issue: when a pod fails with ContainersNotReady|CreateContainerError and the grace period is exceeded, Flyte sometimes classifies the error as USER instead of SYSTEM, causing neither user nor system retries to trigger—even if retries is set on the task. This results in immediate task failure without retries, despite the error being transient and infrastructure-related, which should be handled as a SYSTEM error and retried accordingly. This misclassification is discussed in detail in related issues and code comments, and is acknowledged as a gap in Flyte's error handling logic for pre-execution container failures (github_issue #1234, pod_helper.go).

The expected behavior is for such errors to be classified as SYSTEM errors, so that the platform's system retry budget (e.g., max-node-retries-system-failures) is honored. This would allow Flyte to automatically retry the task when encountering transient container runtime issues like "failed to reserve container name." The current behavior is a bug, and the Flyte team has discussed possible solutions, such as marking these errors as retryable SYSTEM failures or extending the grace period logic (github_issue #1234, pod_helper.go).

Would you like a step-by-step breakdown of the relevant code paths and configuration options for retries in this scenario?

Sources:

github_issue #1234

This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

Dec 03 '25 15:12 runllm[bot]

@charliemoriarty Thank you for reporting this! Are you willing to submit a PR for the fix?

Dec 23 '25 01:12 machichima