Add yaml examples for trainer
Hello,
Your jupyter notebook examples are really handy but I'd like to programmatically create a trainer job from YAML.
Could we also examples of jobs that work out of the box? I wanted to play around with submitting a train job but I wanted to do via kubectl but I couldn't find any examples of ClusterTraininerRuntime or TrainJob CRDs.
Kevin
Sure, we have a few examples of TrainJob YAML in the operator guides:
- https://www.kubeflow.org/docs/components/trainer/operator-guides/migration/#new-trainjob-v2
- https://www.kubeflow.org/docs/components/trainer/operator-guides/runtime/#example-of-clustertrainingruntime
Maybe in the future, we can add more guides (it could be operator guides) for users who are familiar with Kubernetes and YAML. For example, how to configure PodSpecOverrides or setup labels for Kueue/Volcano.
Although, since the primary interface for users (e.g. AI Practitioners) should be Python SDK, I want to avoid having YAML examples under examples/ directory.
Thoughts @kubeflow/kubeflow-trainer-team @astefanutti @kannon92 @kramaranya ?
Although, since the primary interface for users (e.g. AI Practitioners) should be Python SDK, I want to avoid having YAML examples under examples/ directory.
Agree with @andreyvelich. I find the operator guide is sufficient already for Platform Admins, but we could definitely enhance it with more advanced configurations like Gang-scheduling, PodSpecOverrides and queue integration you mentioned. Maybe update it with complete kubectl workflows -- from applying ClusterTrainingRuntimes to submitting TrainJobs and checking status.
Also the Pipeline framework functionality documentation will be really useful https://github.com/kubeflow/website/pull/4039 for showing Platform Admins how to extend Plugins and handle orchestration for different ML frameworks.
I didn't think to check those docs for this. I was trying to find an example in the repo..
doh!
/close
@kannon92: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Could some samples / examples be added into the manifests directory?
We can talk about it if the content in operator guides are not sufficient. I would like us to avoid maintaining examples images since they slow down CI + release.
I was just hoping to find simple hello-world examples without training..
I think, we should create a dedicated guide in the operator which shows how to use TrainJob YAML. The PodSpecOverrides would be a nice candidate, since we need to explain this API.
I'm GPU-poor..
/reopen
@kannon92: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
It's important to provide examples for different training frameworks, as they help newcomers get familiar with how to use the project.
Sure! We would like to add more examples as part of: https://github.com/kubeflow/trainer/issues/2040. We already have examples for DeepSpeed, Torch, and MLX.
However, we want our examples to be Python focus, since AI practitioners can quickly replicate them without understanding the kubectl or Kubernetes concepts.
@xigang Would you like to contribute some examples ?
Hi, I have followed TrainJob (v2) example and try to test under a name space that istio-injection=disabled. But I could see all my pods running but do not see any logs/outputs as expected from PyTorchJob (v1). I exec into the node container and then try to run the minist.py script. It stuck at Using CUDA. Do you have any suggestions?
kubectl describe pod pytorch-simple-node-0-0-7lbsc -n cs-test
Name: pytorch-simple-node-0-0-7lbsc
Namespace: cs-test
Priority: 0
Service Account: default
Node: node-10-239-0-9/10.239.0.9
Start Time: Sat, 01 Nov 2025 23:42:44 +0100
Labels: batch.kubernetes.io/controller-uid=476425e2-e91f-458c-bdbc-c99a2f9401e2
batch.kubernetes.io/job-completion-index=0
batch.kubernetes.io/job-name=pytorch-simple-node-0
controller-uid=476425e2-e91f-458c-bdbc-c99a2f9401e2
job-name=pytorch-simple-node-0
jobset.sigs.k8s.io/global-replicas=1
jobset.sigs.k8s.io/group-name=default
jobset.sigs.k8s.io/group-replicas=1
jobset.sigs.k8s.io/job-global-index=0
jobset.sigs.k8s.io/job-group-index=0
jobset.sigs.k8s.io/job-index=0
jobset.sigs.k8s.io/job-key=feb9ce412eaec2cc441c0d2105a5b9cdf066c4ba
jobset.sigs.k8s.io/jobset-name=pytorch-simple
jobset.sigs.k8s.io/jobset-uid=c741565d-75dc-40de-8fab-6517e2bba03f
jobset.sigs.k8s.io/replicatedjob-name=node
jobset.sigs.k8s.io/replicatedjob-replicas=1
jobset.sigs.k8s.io/restart-attempt=0
Annotations: batch.kubernetes.io/job-completion-index: 0
jobset.sigs.k8s.io/global-replicas: 1
jobset.sigs.k8s.io/group-name: default
jobset.sigs.k8s.io/group-replicas: 1
jobset.sigs.k8s.io/job-global-index: 0
jobset.sigs.k8s.io/job-group-index: 0
jobset.sigs.k8s.io/job-index: 0
jobset.sigs.k8s.io/job-key: feb9ce412eaec2cc441c0d2105a5b9cdf066c4ba
jobset.sigs.k8s.io/jobset-name: pytorch-simple
jobset.sigs.k8s.io/jobset-uid: c741565d-75dc-40de-8fab-6517e2bba03f
jobset.sigs.k8s.io/replicatedjob-name: node
jobset.sigs.k8s.io/replicatedjob-replicas: 1
jobset.sigs.k8s.io/restart-attempt: 0
Status: Running
IP: 192.168.4.11
IPs:
IP: 192.168.4.11
IP: fc00:1000::469
Controlled By: Job/pytorch-simple-node-0
Containers:
node:
Container ID: containerd://7f31241b1dc278d366212bf9136b16cbbe5c2be2b3eb01c29c9289e1ffb6bbdc
Image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
Image ID: docker-ecr002.rnd.gic.ericsson.se/kubeflowkatib/pytorch-mnist@sha256:5164399299fc6ceebcdfa0df5b303a2d63c05776188f55a336c5d3514a4e3227
Port: 29500/TCP
Host Port: 0/TCP
Command:
python3
/opt/pytorch-mnist/mnist.py
--epochs=1
State: Running
Started: Sat, 01 Nov 2025 23:42:45 +0100
Ready: True
Restart Count: 0
Environment:
PET_NNODES: 2
PET_NPROC_PER_NODE: 1
PET_NODE_RANK: (v1:metadata.annotations['batch.kubernetes.io/job-completion-index'])
PET_MASTER_ADDR: pytorch-simple-node-0-0.pytorch-simple
PET_MASTER_PORT: 29500
JOB_COMPLETION_INDEX: (v1:metadata.labels['batch.kubernetes.io/job-completion-index'])
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-stmjj (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-stmjj:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
Normal Scheduled 47s default-scheduler Successfully assigned cs-test/pytorch-simple-node-0-0-7lbsc to node-10-239-0-9 Normal Pulled 47s kubelet Container image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" already present on machine Normal Created 47s kubelet Created container: node Normal Started 47s kubelet Started container node [injhlfnlax@node-10-210-152-99 trainer]$ [injhlfnlax@node-10-210-152-99 trainer]$ [injhlfnlax@node-10-210-152-99 trainer]$ [injhlfnlax@node-10-210-152-99 trainer]$ kubectl get pods -n cs-test NAME READY STATUS RESTARTS AGE ml-pipeline-ui-artifact-86f7bb664c-x5q9n 2/2 Running 0 26h ml-pipeline-visualizationserver-5f75df47b9-xwzzr 2/2 Running 0 26h pytorch-simple-node-0-0-7lbsc 1/1 Running 0 52s pytorch-simple-node-0-1-2lp46 1/1 Running 0 52s test1-0 2/2 Running 0 26h [injhlfnlax@node-10-210-152-99 trainer]$ kubectl logs pytorch-simple-node-0-0-7lbsc -n cs-test [injhlfnlax@node-10-210-152-99 trainer]$ [injhlfnlax@node-10-210-152-99 trainer]$ kubectl logs pytorch-simple-node-0-1-2lp46 -n cs-test [injhlfnlax@node-10-210-152-99 trainer]$
/assign
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.