[BUG] CreateContainerError being misclassified as UserError and not honoring retries
Flyte & Flytekit version
flyte version: v1.15.3 flytekit version: v1.16.5
Describe the bug
When a pod fails to start due to a ContainersNotReady|CreateContainerError (particularly "failed to reserve container name"), and the error message includes "Grace period exceeded", the error is incorrectly classified as kind: USER instead of kind: SYSTEM.
This causes a gap where neither retry mechanism triggers:
- System retries don't apply because the error is classified as USER
- User retries don't apply because the container never started and the task code never ran
The task fails immediately with no retry attempts, even though retries=5 is configured on the task.
Expected behavior
ContainersNotReady|CreateContainerError/grace period exceeded should be classified as a SYSTEM error, which would allow max-node-retries-system-failures to automatically retry the task.
Alternatively, user retries should be triggered for pre-execution failures when the error is classified as USER.
Additional context to reproduce
Example error message:
Grace period [3m0s] exceeded|containers with unready status: [primary]|failed to reserve container name "primary_xxxxx_namespace_uuid_0": name "primary_xxxxx_namespace_uuid_0" is reserved for "containerid123..."
Example CRD information:
nodeStatus:
n0:
error:
code: ContainersNotReady|CreateContainerError
kind: USER
message: 'Grace period [3m0s] exceeded|containers with unready status: [primary]|failed
to reserve container name...'
phase: 6
Screenshots
No response
Are you sure this issue hasn't been raised already?
- [x] Yes
Have you read the Code of Conduct?
- [x] Yes
This is a known issue: when a pod fails with ContainersNotReady|CreateContainerError and the grace period is exceeded, Flyte sometimes classifies the error as USER instead of SYSTEM, causing neither user nor system retries to trigger—even if retries is set on the task. This results in immediate task failure without retries, despite the error being transient and infrastructure-related, which should be handled as a SYSTEM error and retried accordingly. This misclassification is discussed in detail in related issues and code comments, and is acknowledged as a gap in Flyte's error handling logic for pre-execution container failures (github_issue #1234, pod_helper.go).
The expected behavior is for such errors to be classified as SYSTEM errors, so that the platform's system retry budget (e.g., max-node-retries-system-failures) is honored. This would allow Flyte to automatically retry the task when encountering transient container runtime issues like "failed to reserve container name." The current behavior is a bug, and the Flyte team has discussed possible solutions, such as marking these errors as retryable SYSTEM failures or extending the grace period logic (github_issue #1234, pod_helper.go).
Would you like a step-by-step breakdown of the relevant code paths and configuration options for retries in this scenario?
Sources:
This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.
Mention @runllm to ask follow-up questions. See runllm.com for more!
@charliemoriarty Thank you for reporting this! Are you willing to submit a PR for the fix?