training-operator Create model exporter for checkpointing and training output

trafficstars

As we discussed before, as part of the Training V2 APIs we want to design and implement model exporter sidecar which helps users to make checkpointing during distributed training and exporting the trained/fine-tuned model: https://github.com/kubeflow/training-operator/pull/2240#issuecomment-2321081416.

/area storage

Aug 30 '24 18:08 andreyvelich

/assign

Oct 25 '24 12:10 saileshd1402

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jan 23 '25 15:01 github-actions[bot]

/remove-lifecycle stale

Jan 23 '25 15:01 andreyvelich

This issue blocks:

https://github.com/kubeflow/trainer/issues/2401 (Need this issue for exporting models)
https://github.com/kubeflow/trainer/issues/2438 (Direct successor of this issue)

Do you mind if I include this issue as a part of #2401 and resolve it before the GSoC coding period starts? @andreyvelich @saileshd1402

Mar 12 '25 05:03 Electronic-Waste

Exporting the model to the Kubeflow Model Registry can be one option, but we should design our exporter to be extensible for various targets. Should we expand the GSoC project scope to initially focus on designing a flexible model exporter, with support for the Kubeflow Model Registry as the first implementation?

Mar 12 '25 12:03 andreyvelich

cc @akshaychitneni @shravan-achar

Mar 12 '25 12:03 andreyvelich

Exporting the model to the Kubeflow Model Registry can be one option, but we should design our exporter to be extensible for various targets.

I agree with this. We should support PVC, S3, and Model Registry as its destination.

Should we expand the GSoC project scope to initially focus on designing a flexible model exporter, with support for the Kubeflow Model Registry as the first implementation?

That's a great idea. But we should consider the timeline for GSoC project. If we extend the scope of the GSoC project to implement the Model Exporter from scratch, this feature is expected to be completed by this Sept, and may even take longer, like many of our GSoC projects last year.

Since it's an important feature both technically and in terms of user experience, I guess maybe it would be better if we implement the model exporter with basic functionality, like PVC, S3, and then delegate it to GSoC students:)

Pleast let me know what do you think @kubeflow/wg-training-leads @kubeflow/wg-data-leads @saileshd1402 @franciscojavierarceo @juliusvonkohout

Mar 12 '25 13:03 Electronic-Waste

as noted in https://github.com/kubeflow/trainer/issues/2438#issuecomment-2659318238

We should support PVC, S3, and Model Registry as its destination.

Model Registry is a metadata store, and we also just recently merged work that helps orchestrate storage libraries as the prerequisite step to then index/register in Model Registry, using boto3 underthecover for S3, using container toolings underthecover for OCI, per these documentation examples.

Obviously you're not required to use those methods, but hopefully ease the way for storing a model (or a checkpoint of a model) into storage to then be indexed in Model Registry to then be used for KServe deployment etc :)

Mar 12 '25 13:03 tarilabs

I don't think that integration with Model Registry would be hard. We just need to make sure we deploy MR control plane, and call the API in our sidecar to export checkpoints to the Model Registry (in case we decide to do sidecar container approach).

Mar 12 '25 13:03 andreyvelich

I meant that we can firstly support PVC and S3 as the destination of Model Exporter, so that we can allow users to use Model Exporter earlier. And then, we can delegate the work of integrating with Model Registry to GSoC students. Does it sound good to you? @andreyvelich

Mar 12 '25 14:03 Electronic-Waste

.We should support PVC, S3, and Model Registry as its destination.

@Electronic-Waste @andreyvelich In Model Registry as @tarilabs mentioned we are also investigating into ways to associate various storage solutions like S3 and OCI along with the metadata store. This does not make the Model Registry a proxy for storage per se, but it provides easier associations and collects metadata along the way so as not to duplicate the information or ask the user for the storage info again during the registration etc. we wanted to progress in KF 1.10 as we called out but we could not deliver on it.

I would like us to explore how that fits in with model exporter if possible.

Mar 12 '25 22:03 rareddy

I would like us to explore how that fits in with model exporter if possible.

@rareddy Do you think this will be available by June 2025 ?

@Electronic-Waste I think, we should design our exporter in a way that in can natively support blob storages, but also the writing metadata to the Kubeflow MR if user wants. In case of Kubeflow MR, user can bypass the storage path that MR supports (e.g. OCI, S3, GCS).

Can we talk about it on the next WG calls ?

Training WG
Model Registry call.

Mar 13 '25 00:03 andreyvelich

@andreyvelich Sure. Let's discuss it in the next WG call. Which one do you prefer?

Mar 13 '25 02:03 Electronic-Waste

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jun 11 '25 20:06 github-actions[bot]

/remove-lifecycle stale

Jun 12 '25 04:06 tenzen-y

Hi folks! I am actively integrating Kubeflow Trainer in our LLM fine-tuning framework. It would be great to complement the exporter as the missing piece. As an initial and naive design, I simply add a job after the node job in the jobset:

- name: exporter
  dependsOn:
    - name: node
      status: Complete
  template:
    spec:
      template:
        spec:
          containers:
            - name: exporter
              image: alpine:latest
              command:
                - sh
                - -c
              args:
                - # Upload models in `/workspace/output`
              volumeMounts:
                - name: initializer
                  mountPath: /workspace
          volumes:
            - name: initializer
              persistentVolumeClaim:
                claimName: torchtune-gemma2-2b

I think we could divide the handling of model exporting and checkpointing, since the integration of checkpointing could be more complex depending on users' desired behaviors. For example, for torchtune, config modifications have to be made when using the checkpoints as mentioned in the documentation. A possible initial implementation could be to trigger the exporter after the node job failed (although currently jobset does not support it) or use a sidecar container. Looking forward to the feedback.

/cc @andreyvelich @tenzen-y @Electronic-Waste

Jul 12 '25 13:07 rudeigerc

This is great to hear about your integrations @rudeigerc!

For the job I guess we should discuss the role of exporter in Kubeflow Trainer. Will user still be responsible to save model to the disk and our custom job just uploads model to the model registry ?

Additionally, we might want to run sidecar since users might want to export model checkpoints during the training as you said.

Also, I would love us to collaborate with folks who want to start Kubernetes Checkpoint/Restore WG. Since this is a perfect use-case for us to explore: https://groups.google.com/a/kubernetes.io/g/dev/c/Q3qZnCBAtok/m/vAq8qumKCgAJ?utm_medium=email&utm_source=footer

cc @adrianreber @rst0git

Jul 14 '25 18:07 andreyvelich

@andreyvelich Thank you so much for the heads-up! We are also working on enabling support for transparent checkpointing of distributed training workloads and would be happy to collaborate on this use-case. The following paper describes our previous work on this topic: CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads

Jul 15 '25 07:07 rst0git

@rst0git @rst0git @andreyvelich we should definitely collaborate, specifically on the multi-nodes support for CRIUgpu / cuda-checkpoint and the overall API integration (kubelet vs API server).

Another major use case is the support for topology changes between checkpoint and restore, similar to PyTorch distributed checkpointing load-time resharding, or DeepSpeed universal checkpointing to accommodate parallelism changes.

Jul 15 '25 10:07 astefanutti

This is great to hear @rst0git @astefanutti, super interesting! Let's talk about it during one of the upcoming Training WG calls.

cc @shravan-achar @bigsur0 @akshaychitneni

Jul 15 '25 14:07 andreyvelich

Hi Folks, as a reminder I added this topic to today's Training WG call in 40 minutes. https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit?tab=t.0#heading=h.us3k0u9oc2q4

cc @adrianreber @rst0git

Aug 06 '25 16:08 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 04 '25 20:11 github-actions[bot]

/remove-lifecycle stale

Nov 05 '25 06:11 rst0git

training-operator training-operator copied to clipboard

Create model exporter for checkpointing and training output

training-operator
training-operator copied to clipboard