K8s Job to sync models from one model storage location to another
Is your feature request related to a problem? Please describe.
I would like to propose a feature to be able to copy one model from one storage location to another storage location, registering the model in the process, and generating a ModelCar as a part of that process.
A common case is when a user wants all of their models to live in a centralized OCI-compliant storage layer (quay, self-hosted, etc), but they may not necessarily have all models they are interested currently in that storage layer.
For example, if you were interested in a model from a raw S3 URL or from HuggingFace, you might want to keep the base-model stored in your own OCI-compliant storage layer (or some other storage location). This also must be something that can be triggered from the UI, or independently of the UI.
Describe the solution you'd like
A K8s (vanilla) Job which will take a number of parameters and env vars, read those variables, and perform the synchronization job independently of the MR Service and MR UI.
This Job will initially be triggered by a UI action (to be done later), or a raw YAML/JSON k8s Job resource can be created out of band.
Later, if desired, we can enable the MR itself to expose an endpoint which can kick off these jobs in a structured way.
During this synchronization Job, various sanity checks will be performed to ensure the job can be completed, such as validating source and destination credentials, URIs, modelName (and other model-related params), etc.
In addition, it will generate a ModelCar for that model to support model utilization in KServe, etc.
For a first crack at what this k8s Job manifest might look like, I have a sample below. Note; the parameters shown are not an exhaustive list of parameters (notably, MR URL and Port are absent)
---
apiVersion: batch/v1
kind: Job
metadata:
name: my-async-upload-Job
namespace: model-registry # Or somewhere else
spec:
template:
spec:
containers:
- name: async-upload
image: kubeflow/model-registry-async-job:latest
command: ['/bin/sh', '-c', 'python', '/app/async_upload.py']
args:
- |
# Upload params passed in here
--source-uri=s3://my-bucket/my-model
--source-type=s3
--destination-uri=oci://my-registry/my-model
--destination-type=oci
--model-name=my-model
--model-version=1.0.0
--model-format=onnx
--model-format-version=1.0
restartPolicy: Never
volumeMounts:
- name: source-creds
mountPath: "/mnt/source-creds"
readOnly: true
- name: destination-creds
mountPath: "/mnt/destination-creds"
readOnly: true
env:
# ---- Connection Secrets ----
- name: SOURCE_SECRET_PATH
value: "/mnt/source-secret" # <-- there will be a default, TBD where
# or...
- name: SOURCE__AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: minio-secret
key: ACCESS_KEY_ID
# ...
- name: DESTINATION__OCI_REGISTRY_USERNAME
valueFrom:
secretKeyRef:
name: oci-secret
key: OCI_REGISTRY_USERNAME
# ...
# ---- Model Params ----
- name: MR__MODEL_NAME
value: "my-model"
- name: MR__MODEL_VERSION
value: "1.0.0"
- name: MR__MODEL_FORMAT
value: "onnx"
# ...
volumes:
- name: source-creds
secret:
secretName: my-source-secret
- name: destination-creds
secret:
secretName: my-destination-secret
Note: Other Job params like suspend, backoffLimit, etc are not present for brevity
The ENV vars supported will match 1-to-1 with the args that can be provided to the command. A short example of this equivalence between variable input is below:
Env to Param mapping
| Env Var Name | arg Param |
|---|---|
| SOURCE__SECRET_PATH | source-secret-path |
| DESTINATION__OCI_REGISTRY_USERNAME | destination-oci-registry-username |
| MR__MODEL_NAME | model-name |
| MR__MODEL_VERSION | model-version |
| MR__MODEL_FORMAT | model-format |
Examples
| Env Var | Param |
|---|---|
| MR__MODEL_NAME="my-model" | --model-name "my-model" |
| MR__MODEL_VERSION="1.0.0" | --model-version "1.0.0" |
| MR__MODEL_FORMAT="onnx" | --model-format "onnx" |
+1 @Crazyglue thanks for raising this; we have these capabilities in the MR py client per last iteration cycles, this will make them also accessible from a K8s (async) job.
If we read secrets as files instead of exploding them into env params, we can keep the job spec small and independent of the secret contents. The Python job implementation can read the contents and load what it needs based on source/destination connection types. wdyt?
If we read secrets as files instead of exploding them into env params, we can keep the job spec small and independent of the secret contents. The Python job implementation can read the contents and load what it needs based on source/destination connection types. wdyt?
Indeed, this is the primary method of injecting secrets/credentials into the job imo. However, I think it makes sense to also provide those as ENV vars for the case where that might make sense.
The main distinction to make is between source and destination credentials, as those of course may be different
If it isn't too much of a hassle, perhaps its worth reading from all 3 sources.. file > params > env (in the cases where both are present)
One thing I want to note down: we should consider providing as input "coordinates" for the Job the RegisteredModel.ID and the ModelVersion.ID; since we're evaluating option to edit the "name", I believe capturing the logical model entities ID make the intent more explicit since the Job runs asynchronously (ie with a delay).
@Crazyglue are we thinking of creating bespoke copy jobs for different source and destination combinations, or something generic? The former would specify env variables, etc. where as the latter could be agnostic by mounting secrets as volumes and discover source and destination store types at runtime.
Any thoughts about the pros/cons?
@dhirajsb My thinking is that this would be agnostic, so the generic case you have outlined. It will look for mounted env vars/command-args/etc and perform a generic replication task using those parameters. If those parameters are malformed/missing/etc, then the Job will exit(1).
I suppose if we were to structure it by specifying a source and destination, we would need a data-model for those, and that data would have to live somewhere (likely defined as a CRD). This is quite a bit of overhead for a simple copy/paste job so I wanted to avoid it if possible.
If we did go that route, the actual mechanism by which we trigger these jobs would be extremely simple and we could do a lot of heavy lifting on the backend to pass credentials/params into the job. It would also likely require another CRD like a kubeflow.com/v1alpha1/AsyncJob which would have a spec that could take references to these source/destination CRs.
I think ^ is actually a good approach generally, as this would allow the MR server itself to expose an endpoint that the UI and/or python-client could trigger (many validations can be moved to the API, and we can guarantee jobs can be executed before we attempt to execute them. It also scales much better in terms of usability).
However, given the time needed to build out such a system, I'm not sure the overhead is worth it at this moment, which is why I'm leaning towards just defining a Job template and building the dockerfile to expect it to be triggered following that template. Basically I think its worth proving out the concept and getting it out there sooner, and later we can iterate to scale it and improve usability.
WDYT?
If those parameters are malformed/missing/etc, then the Job will exit(1).
yes, and please let's not forget from the requirement of:
in case of failures, leverage terminationMessagePath for a concise description of the failure
My thinking is that this would be agnostic, so the generic case you have outlined
+1
and to
Any thoughts about the pros/cons?
a pro is that we don't need any logic for this in the GUI. naturally as you expressed, some checks can be performed early-on (eg check the Push credentials are valid, without having to wait for the actual Push as part of the job).
Basically I think its worth proving out the concept and getting it out there sooner, and later we can iterate to scale it and improve usability.
+1
It will look for mounted env vars/command-args/etc and perform a generic replication task using those parameters. If those parameters are malformed/missing/etc, then the Job will exit(1).
I would encourage to check if there is any way we can leverage https://pypi.org/project/gila/ or similar as it was previously noted by @dhirajsb to do it 12-factor -way.
Right, so @tarilabs mostly summarized what we discussed in our 1-on-1 on this.
- Let's build the python executable to use gila and support 12-factor injection of secrets
- Let's build the Job to be a generic/agnostic job, and always inject secrets through paths and have the python job read the properties dynamically
That way the Job doesn't have to have properties specific to source/destination types. If we do, then the client/dashboard would need complete knowledge of what the properties should be. Instead, we could simply point the Job to all the properties by passing the source and destination secrets on different paths.
Does that work?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.