training-operator
training-operator copied to clipboard
Create model exporter for checkpointing and training output
As we discussed before, as part of the Training V2 APIs we want to design and implement model exporter sidecar which helps users to make checkpointing during distributed training and exporting the trained/fine-tuned model: https://github.com/kubeflow/training-operator/pull/2240#issuecomment-2321081416.
/area storage
/assign
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
This issue blocks:
- https://github.com/kubeflow/trainer/issues/2401 (Need this issue for exporting models)
- https://github.com/kubeflow/trainer/issues/2438 (Direct successor of this issue)
Do you mind if I include this issue as a part of #2401 and resolve it before the GSoC coding period starts? @andreyvelich @saileshd1402
Exporting the model to the Kubeflow Model Registry can be one option, but we should design our exporter to be extensible for various targets. Should we expand the GSoC project scope to initially focus on designing a flexible model exporter, with support for the Kubeflow Model Registry as the first implementation?
cc @akshaychitneni @shravan-achar
Exporting the model to the Kubeflow Model Registry can be one option, but we should design our exporter to be extensible for various targets.
I agree with this. We should support PVC, S3, and Model Registry as its destination.
Should we expand the GSoC project scope to initially focus on designing a flexible model exporter, with support for the Kubeflow Model Registry as the first implementation?
That's a great idea. But we should consider the timeline for GSoC project. If we extend the scope of the GSoC project to implement the Model Exporter from scratch, this feature is expected to be completed by this Sept, and may even take longer, like many of our GSoC projects last year.
Since it's an important feature both technically and in terms of user experience, I guess maybe it would be better if we implement the model exporter with basic functionality, like PVC, S3, and then delegate it to GSoC students:)
Pleast let me know what do you think @kubeflow/wg-training-leads @kubeflow/wg-data-leads @saileshd1402 @franciscojavierarceo @juliusvonkohout
as noted in https://github.com/kubeflow/trainer/issues/2438#issuecomment-2659318238
We should support PVC, S3, and Model Registry as its destination.
Model Registry is a metadata store, and we also just recently merged work that helps orchestrate storage libraries as the prerequisite step to then index/register in Model Registry, using boto3 underthecover for S3, using container toolings underthecover for OCI, per these documentation examples.
Obviously you're not required to use those methods, but hopefully ease the way for storing a model (or a checkpoint of a model) into storage to then be indexed in Model Registry to then be used for KServe deployment etc :)
I don't think that integration with Model Registry would be hard. We just need to make sure we deploy MR control plane, and call the API in our sidecar to export checkpoints to the Model Registry (in case we decide to do sidecar container approach).
I meant that we can firstly support PVC and S3 as the destination of Model Exporter, so that we can allow users to use Model Exporter earlier. And then, we can delegate the work of integrating with Model Registry to GSoC students. Does it sound good to you? @andreyvelich
.We should support PVC, S3, and Model Registry as its destination.
@Electronic-Waste @andreyvelich In Model Registry as @tarilabs mentioned we are also investigating into ways to associate various storage solutions like S3 and OCI along with the metadata store. This does not make the Model Registry a proxy for storage per se, but it provides easier associations and collects metadata along the way so as not to duplicate the information or ask the user for the storage info again during the registration etc. we wanted to progress in KF 1.10 as we called out but we could not deliver on it.
I would like us to explore how that fits in with model exporter if possible.
I would like us to explore how that fits in with model exporter if possible.
@rareddy Do you think this will be available by June 2025 ?
@Electronic-Waste I think, we should design our exporter in a way that in can natively support blob storages, but also the writing metadata to the Kubeflow MR if user wants. In case of Kubeflow MR, user can bypass the storage path that MR supports (e.g. OCI, S3, GCS).
Can we talk about it on the next WG calls ?
- Training WG
- Model Registry call.
@andreyvelich Sure. Let's discuss it in the next WG call. Which one do you prefer?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
Hi folks! I am actively integrating Kubeflow Trainer in our LLM fine-tuning framework. It would be great to complement the exporter as the missing piece. As an initial and naive design, I simply add a job after the node job in the jobset:
- name: exporter
dependsOn:
- name: node
status: Complete
template:
spec:
template:
spec:
containers:
- name: exporter
image: alpine:latest
command:
- sh
- -c
args:
- # Upload models in `/workspace/output`
volumeMounts:
- name: initializer
mountPath: /workspace
volumes:
- name: initializer
persistentVolumeClaim:
claimName: torchtune-gemma2-2b
I think we could divide the handling of model exporting and checkpointing, since the integration of checkpointing could be more complex depending on users' desired behaviors. For example, for torchtune, config modifications have to be made when using the checkpoints as mentioned in the documentation. A possible initial implementation could be to trigger the exporter after the node job failed (although currently jobset does not support it) or use a sidecar container. Looking forward to the feedback.
/cc @andreyvelich @tenzen-y @Electronic-Waste
This is great to hear about your integrations @rudeigerc!
For the job I guess we should discuss the role of exporter in Kubeflow Trainer. Will user still be responsible to save model to the disk and our custom job just uploads model to the model registry ?
Additionally, we might want to run sidecar since users might want to export model checkpoints during the training as you said.
Also, I would love us to collaborate with folks who want to start Kubernetes Checkpoint/Restore WG. Since this is a perfect use-case for us to explore: https://groups.google.com/a/kubernetes.io/g/dev/c/Q3qZnCBAtok/m/vAq8qumKCgAJ?utm_medium=email&utm_source=footer
cc @adrianreber @rst0git
@andreyvelich Thank you so much for the heads-up! We are also working on enabling support for transparent checkpointing of distributed training workloads and would be happy to collaborate on this use-case. The following paper describes our previous work on this topic: CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads
@rst0git @rst0git @andreyvelich we should definitely collaborate, specifically on the multi-nodes support for CRIUgpu / cuda-checkpoint and the overall API integration (kubelet vs API server).
Another major use case is the support for topology changes between checkpoint and restore, similar to PyTorch distributed checkpointing load-time resharding, or DeepSpeed universal checkpointing to accommodate parallelism changes.
This is great to hear @rst0git @astefanutti, super interesting! Let's talk about it during one of the upcoming Training WG calls.
cc @shravan-achar @bigsur0 @akshaychitneni
Hi Folks, as a reminder I added this topic to today's Training WG call in 40 minutes. https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit?tab=t.0#heading=h.us3k0u9oc2q4
cc @adrianreber @rst0git
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale