training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Add yaml examples for trainer

Open kannon92 opened this issue 7 months ago • 16 comments

Hello,

Your jupyter notebook examples are really handy but I'd like to programmatically create a trainer job from YAML.

Could we also examples of jobs that work out of the box? I wanted to play around with submitting a train job but I wanted to do via kubectl but I couldn't find any examples of ClusterTraininerRuntime or TrainJob CRDs.

Kevin

kannon92 avatar Aug 03 '25 19:08 kannon92

Sure, we have a few examples of TrainJob YAML in the operator guides:

  • https://www.kubeflow.org/docs/components/trainer/operator-guides/migration/#new-trainjob-v2
  • https://www.kubeflow.org/docs/components/trainer/operator-guides/runtime/#example-of-clustertrainingruntime

Maybe in the future, we can add more guides (it could be operator guides) for users who are familiar with Kubernetes and YAML. For example, how to configure PodSpecOverrides or setup labels for Kueue/Volcano.

Although, since the primary interface for users (e.g. AI Practitioners) should be Python SDK, I want to avoid having YAML examples under examples/ directory.

Thoughts @kubeflow/kubeflow-trainer-team @astefanutti @kannon92 @kramaranya ?

andreyvelich avatar Aug 04 '25 15:08 andreyvelich

Although, since the primary interface for users (e.g. AI Practitioners) should be Python SDK, I want to avoid having YAML examples under examples/ directory.

Agree with @andreyvelich. I find the operator guide is sufficient already for Platform Admins, but we could definitely enhance it with more advanced configurations like Gang-scheduling, PodSpecOverrides and queue integration you mentioned. Maybe update it with complete kubectl workflows -- from applying ClusterTrainingRuntimes to submitting TrainJobs and checking status.

Also the Pipeline framework functionality documentation will be really useful https://github.com/kubeflow/website/pull/4039 for showing Platform Admins how to extend Plugins and handle orchestration for different ML frameworks.

kramaranya avatar Aug 04 '25 16:08 kramaranya

I didn't think to check those docs for this. I was trying to find an example in the repo..

doh!

kannon92 avatar Aug 04 '25 17:08 kannon92

/close

kannon92 avatar Aug 06 '25 01:08 kannon92

@kannon92: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Aug 06 '25 01:08 google-oss-prow[bot]

Could some samples / examples be added into the manifests directory?

astefanutti avatar Aug 06 '25 07:08 astefanutti

We can talk about it if the content in operator guides are not sufficient. I would like us to avoid maintaining examples images since they slow down CI + release.

andreyvelich avatar Aug 06 '25 16:08 andreyvelich

I was just hoping to find simple hello-world examples without training..

kannon92 avatar Aug 06 '25 16:08 kannon92

I think, we should create a dedicated guide in the operator which shows how to use TrainJob YAML. The PodSpecOverrides would be a nice candidate, since we need to explain this API.

andreyvelich avatar Aug 06 '25 16:08 andreyvelich

I'm GPU-poor..

kannon92 avatar Aug 06 '25 16:08 kannon92

/reopen

kannon92 avatar Aug 06 '25 16:08 kannon92

@kannon92: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Aug 06 '25 16:08 google-oss-prow[bot]

It's important to provide examples for different training frameworks, as they help newcomers get familiar with how to use the project.

xigang avatar Aug 08 '25 08:08 xigang

Sure! We would like to add more examples as part of: https://github.com/kubeflow/trainer/issues/2040. We already have examples for DeepSpeed, Torch, and MLX.

However, we want our examples to be Python focus, since AI practitioners can quickly replicate them without understanding the kubectl or Kubernetes concepts.

@xigang Would you like to contribute some examples ?

andreyvelich avatar Aug 08 '25 16:08 andreyvelich

Hi, I have followed TrainJob (v2) example and try to test under a name space that istio-injection=disabled. But I could see all my pods running but do not see any logs/outputs as expected from PyTorchJob (v1). I exec into the node container and then try to run the minist.py script. It stuck at Using CUDA. Do you have any suggestions?

kubectl describe pod pytorch-simple-node-0-0-7lbsc -n cs-test Name: pytorch-simple-node-0-0-7lbsc Namespace: cs-test Priority: 0 Service Account: default Node: node-10-239-0-9/10.239.0.9 Start Time: Sat, 01 Nov 2025 23:42:44 +0100 Labels: batch.kubernetes.io/controller-uid=476425e2-e91f-458c-bdbc-c99a2f9401e2 batch.kubernetes.io/job-completion-index=0 batch.kubernetes.io/job-name=pytorch-simple-node-0 controller-uid=476425e2-e91f-458c-bdbc-c99a2f9401e2 job-name=pytorch-simple-node-0 jobset.sigs.k8s.io/global-replicas=1 jobset.sigs.k8s.io/group-name=default jobset.sigs.k8s.io/group-replicas=1 jobset.sigs.k8s.io/job-global-index=0 jobset.sigs.k8s.io/job-group-index=0 jobset.sigs.k8s.io/job-index=0 jobset.sigs.k8s.io/job-key=feb9ce412eaec2cc441c0d2105a5b9cdf066c4ba jobset.sigs.k8s.io/jobset-name=pytorch-simple jobset.sigs.k8s.io/jobset-uid=c741565d-75dc-40de-8fab-6517e2bba03f jobset.sigs.k8s.io/replicatedjob-name=node jobset.sigs.k8s.io/replicatedjob-replicas=1 jobset.sigs.k8s.io/restart-attempt=0 Annotations: batch.kubernetes.io/job-completion-index: 0 jobset.sigs.k8s.io/global-replicas: 1 jobset.sigs.k8s.io/group-name: default jobset.sigs.k8s.io/group-replicas: 1 jobset.sigs.k8s.io/job-global-index: 0 jobset.sigs.k8s.io/job-group-index: 0 jobset.sigs.k8s.io/job-index: 0 jobset.sigs.k8s.io/job-key: feb9ce412eaec2cc441c0d2105a5b9cdf066c4ba jobset.sigs.k8s.io/jobset-name: pytorch-simple jobset.sigs.k8s.io/jobset-uid: c741565d-75dc-40de-8fab-6517e2bba03f jobset.sigs.k8s.io/replicatedjob-name: node jobset.sigs.k8s.io/replicatedjob-replicas: 1 jobset.sigs.k8s.io/restart-attempt: 0 Status: Running IP: 192.168.4.11 IPs: IP: 192.168.4.11 IP: fc00:1000::469 Controlled By: Job/pytorch-simple-node-0 Containers: node: Container ID: containerd://7f31241b1dc278d366212bf9136b16cbbe5c2be2b3eb01c29c9289e1ffb6bbdc Image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727 Image ID: docker-ecr002.rnd.gic.ericsson.se/kubeflowkatib/pytorch-mnist@sha256:5164399299fc6ceebcdfa0df5b303a2d63c05776188f55a336c5d3514a4e3227 Port: 29500/TCP Host Port: 0/TCP Command: python3 /opt/pytorch-mnist/mnist.py --epochs=1 State: Running Started: Sat, 01 Nov 2025 23:42:45 +0100 Ready: True Restart Count: 0 Environment: PET_NNODES: 2 PET_NPROC_PER_NODE: 1 PET_NODE_RANK: (v1:metadata.annotations['batch.kubernetes.io/job-completion-index']) PET_MASTER_ADDR: pytorch-simple-node-0-0.pytorch-simple PET_MASTER_PORT: 29500 JOB_COMPLETION_INDEX: (v1:metadata.labels['batch.kubernetes.io/job-completion-index']) Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-stmjj (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-stmjj: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message


Normal Scheduled 47s default-scheduler Successfully assigned cs-test/pytorch-simple-node-0-0-7lbsc to node-10-239-0-9 Normal Pulled 47s kubelet Container image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" already present on machine Normal Created 47s kubelet Created container: node Normal Started 47s kubelet Started container node [injhlfnlax@node-10-210-152-99 trainer]$ [injhlfnlax@node-10-210-152-99 trainer]$ [injhlfnlax@node-10-210-152-99 trainer]$ [injhlfnlax@node-10-210-152-99 trainer]$ kubectl get pods -n cs-test NAME READY STATUS RESTARTS AGE ml-pipeline-ui-artifact-86f7bb664c-x5q9n 2/2 Running 0 26h ml-pipeline-visualizationserver-5f75df47b9-xwzzr 2/2 Running 0 26h pytorch-simple-node-0-0-7lbsc 1/1 Running 0 52s pytorch-simple-node-0-1-2lp46 1/1 Running 0 52s test1-0 2/2 Running 0 26h [injhlfnlax@node-10-210-152-99 trainer]$ kubectl logs pytorch-simple-node-0-0-7lbsc -n cs-test [injhlfnlax@node-10-210-152-99 trainer]$ [injhlfnlax@node-10-210-152-99 trainer]$ kubectl logs pytorch-simple-node-0-1-2lp46 -n cs-test [injhlfnlax@node-10-210-152-99 trainer]$

skb888 avatar Nov 01 '25 22:11 skb888

/assign

NarayanaSabari avatar Nov 06 '25 04:11 NarayanaSabari

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Feb 04 '26 05:02 github-actions[bot]