model-registry icon indicating copy to clipboard operation
model-registry copied to clipboard

K8s Job to sync models from one model storage location to another

Open Crazyglue opened this issue 7 months ago • 13 comments

Is your feature request related to a problem? Please describe.

I would like to propose a feature to be able to copy one model from one storage location to another storage location, registering the model in the process, and generating a ModelCar as a part of that process.

A common case is when a user wants all of their models to live in a centralized OCI-compliant storage layer (quay, self-hosted, etc), but they may not necessarily have all models they are interested currently in that storage layer.

For example, if you were interested in a model from a raw S3 URL or from HuggingFace, you might want to keep the base-model stored in your own OCI-compliant storage layer (or some other storage location). This also must be something that can be triggered from the UI, or independently of the UI.

Describe the solution you'd like

A K8s (vanilla) Job which will take a number of parameters and env vars, read those variables, and perform the synchronization job independently of the MR Service and MR UI.

This Job will initially be triggered by a UI action (to be done later), or a raw YAML/JSON k8s Job resource can be created out of band.

Later, if desired, we can enable the MR itself to expose an endpoint which can kick off these jobs in a structured way.

During this synchronization Job, various sanity checks will be performed to ensure the job can be completed, such as validating source and destination credentials, URIs, modelName (and other model-related params), etc.

In addition, it will generate a ModelCar for that model to support model utilization in KServe, etc.

Crazyglue avatar May 13 '25 19:05 Crazyglue

For a first crack at what this k8s Job manifest might look like, I have a sample below. Note; the parameters shown are not an exhaustive list of parameters (notably, MR URL and Port are absent)

---
apiVersion: batch/v1
kind: Job
metadata:
  name: my-async-upload-Job
  namespace: model-registry # Or somewhere else
spec:
  template:
    spec:
      containers:
        - name: async-upload
          image: kubeflow/model-registry-async-job:latest
          command: ['/bin/sh', '-c', 'python', '/app/async_upload.py']
          args:
            - |
              # Upload params passed in here
              --source-uri=s3://my-bucket/my-model
              --source-type=s3
              --destination-uri=oci://my-registry/my-model
              --destination-type=oci
              --model-name=my-model
              --model-version=1.0.0
              --model-format=onnx
              --model-format-version=1.0
          restartPolicy: Never
          volumeMounts:
            - name: source-creds
              mountPath: "/mnt/source-creds"
              readOnly: true
            - name: destination-creds
              mountPath: "/mnt/destination-creds"
              readOnly: true
          env:
            # ---- Connection Secrets ----
            - name: SOURCE_SECRET_PATH
              value: "/mnt/source-secret" # <-- there will be a default, TBD where
            # or...
            - name: SOURCE__AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: minio-secret
                  key: ACCESS_KEY_ID
            # ...
            - name: DESTINATION__OCI_REGISTRY_USERNAME
              valueFrom:
                secretKeyRef:
                  name: oci-secret
                  key: OCI_REGISTRY_USERNAME
            # ...
            # ---- Model Params ----
            - name: MR__MODEL_NAME
              value: "my-model"
            - name: MR__MODEL_VERSION
              value: "1.0.0"
            - name: MR__MODEL_FORMAT
              value: "onnx"
            # ...
      
      volumes:
        - name: source-creds
          secret:
            secretName: my-source-secret
        - name: destination-creds
          secret:
            secretName: my-destination-secret

Note: Other Job params like suspend, backoffLimit, etc are not present for brevity


The ENV vars supported will match 1-to-1 with the args that can be provided to the command. A short example of this equivalence between variable input is below:

Env to Param mapping

Env Var Name arg Param
SOURCE__SECRET_PATH source-secret-path
DESTINATION__OCI_REGISTRY_USERNAME destination-oci-registry-username
MR__MODEL_NAME model-name
MR__MODEL_VERSION model-version
MR__MODEL_FORMAT model-format

Examples

Env Var Param
MR__MODEL_NAME="my-model" --model-name "my-model"
MR__MODEL_VERSION="1.0.0" --model-version "1.0.0"
MR__MODEL_FORMAT="onnx" --model-format "onnx"

Crazyglue avatar May 13 '25 19:05 Crazyglue

+1 @Crazyglue thanks for raising this; we have these capabilities in the MR py client per last iteration cycles, this will make them also accessible from a K8s (async) job.

tarilabs avatar May 13 '25 19:05 tarilabs

If we read secrets as files instead of exploding them into env params, we can keep the job spec small and independent of the secret contents. The Python job implementation can read the contents and load what it needs based on source/destination connection types. wdyt?

dhirajsb avatar May 13 '25 21:05 dhirajsb

If we read secrets as files instead of exploding them into env params, we can keep the job spec small and independent of the secret contents. The Python job implementation can read the contents and load what it needs based on source/destination connection types. wdyt?

Indeed, this is the primary method of injecting secrets/credentials into the job imo. However, I think it makes sense to also provide those as ENV vars for the case where that might make sense.

The main distinction to make is between source and destination credentials, as those of course may be different

If it isn't too much of a hassle, perhaps its worth reading from all 3 sources.. file > params > env (in the cases where both are present)

Crazyglue avatar May 13 '25 22:05 Crazyglue

One thing I want to note down: we should consider providing as input "coordinates" for the Job the RegisteredModel.ID and the ModelVersion.ID; since we're evaluating option to edit the "name", I believe capturing the logical model entities ID make the intent more explicit since the Job runs asynchronously (ie with a delay).

tarilabs avatar May 14 '25 14:05 tarilabs

@Crazyglue are we thinking of creating bespoke copy jobs for different source and destination combinations, or something generic? The former would specify env variables, etc. where as the latter could be agnostic by mounting secrets as volumes and discover source and destination store types at runtime.

Any thoughts about the pros/cons?

dhirajsb avatar May 19 '25 23:05 dhirajsb

@dhirajsb My thinking is that this would be agnostic, so the generic case you have outlined. It will look for mounted env vars/command-args/etc and perform a generic replication task using those parameters. If those parameters are malformed/missing/etc, then the Job will exit(1).

I suppose if we were to structure it by specifying a source and destination, we would need a data-model for those, and that data would have to live somewhere (likely defined as a CRD). This is quite a bit of overhead for a simple copy/paste job so I wanted to avoid it if possible.

If we did go that route, the actual mechanism by which we trigger these jobs would be extremely simple and we could do a lot of heavy lifting on the backend to pass credentials/params into the job. It would also likely require another CRD like a kubeflow.com/v1alpha1/AsyncJob which would have a spec that could take references to these source/destination CRs.

I think ^ is actually a good approach generally, as this would allow the MR server itself to expose an endpoint that the UI and/or python-client could trigger (many validations can be moved to the API, and we can guarantee jobs can be executed before we attempt to execute them. It also scales much better in terms of usability).

However, given the time needed to build out such a system, I'm not sure the overhead is worth it at this moment, which is why I'm leaning towards just defining a Job template and building the dockerfile to expect it to be triggered following that template. Basically I think its worth proving out the concept and getting it out there sooner, and later we can iterate to scale it and improve usability.

WDYT?

Crazyglue avatar May 20 '25 14:05 Crazyglue

If those parameters are malformed/missing/etc, then the Job will exit(1).

yes, and please let's not forget from the requirement of:

in case of failures, leverage terminationMessagePath for a concise description of the failure

tarilabs avatar May 20 '25 15:05 tarilabs

My thinking is that this would be agnostic, so the generic case you have outlined

+1

and to

Any thoughts about the pros/cons?

a pro is that we don't need any logic for this in the GUI. naturally as you expressed, some checks can be performed early-on (eg check the Push credentials are valid, without having to wait for the actual Push as part of the job).

Basically I think its worth proving out the concept and getting it out there sooner, and later we can iterate to scale it and improve usability.

+1

tarilabs avatar May 20 '25 16:05 tarilabs

It will look for mounted env vars/command-args/etc and perform a generic replication task using those parameters. If those parameters are malformed/missing/etc, then the Job will exit(1).

I would encourage to check if there is any way we can leverage https://pypi.org/project/gila/ or similar as it was previously noted by @dhirajsb to do it 12-factor -way.

tarilabs avatar May 20 '25 17:05 tarilabs

Right, so @tarilabs mostly summarized what we discussed in our 1-on-1 on this.

  • Let's build the python executable to use gila and support 12-factor injection of secrets
  • Let's build the Job to be a generic/agnostic job, and always inject secrets through paths and have the python job read the properties dynamically

That way the Job doesn't have to have properties specific to source/destination types. If we do, then the client/dashboard would need complete knowledge of what the properties should be. Instead, we could simply point the Job to all the properties by passing the source and destination secrets on different paths.

Does that work?

dhirajsb avatar May 22 '25 15:05 dhirajsb

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 21 '25 04:08 github-actions[bot]