training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

[Feature request] Exit code on failure

Open tonykim-moloco opened this issue 2 years ago • 5 comments

Feature Request

I would like to propose a feature to get the exit code of the chief pod via TFJob. I know that the TFJob already takes care of whether-to-restart based on restartPolicy and the exit code internally. Can we expose that information via TFJob? Currently we want to handle the restart externally and it seems it is hard to get that information via TFJob CRD.

tonykim-moloco avatar Feb 01 '23 18:02 tonykim-moloco

@tonykim-moloco what info do you need via TFJob ? Can you elaborate more

johnugeorge avatar May 17 '23 11:05 johnugeorge

TFJob currently only exposes a wrapped information - Succeeded | Created | Running | Failed However, in failure cases, I would like to know the exit code about the pod to understand about why it failed. For example, if a worker pod crashed due to OOM (exit code 137), I would like to look up TFJob to see that exit code was 137 on one of the workers.

tonykim-moloco avatar May 17 '23 21:05 tonykim-moloco

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 23 '23 20:08 github-actions[bot]

Probably, we can support this request once we introduce batch/Job.

#1718

tenzen-y avatar Aug 24 '23 13:08 tenzen-y

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 22 '23 20:11 github-actions[bot]