model-registry can verify the MR manifest with the KF 1.9.rc0

May 13 '24 17:05 tarilabs

I've tested with 1.9.0-rc.0 and beyond #101 lgtm

May 22 '24 09:05 tarilabs

Hi @tarilabs ,

Which notebook image did you use to test? kubeflownotebookswg/jupyter-tensorflow-full:v1.8.0-rc.6 includes Python 3.11.6 which is not supported by model-registry.

Thanks

Tian

May 31 '24 01:05 tiansiyuan

Which notebook image did you use to test? kubeflownotebookswg/jupyter-tensorflow-full:v1.8.0-rc.6 includes Python 3.11.6 which is not supported by model-registry.

we align with Google MLMD here: https://pypi.org/project/ml-metadata/1.14.0/#files

so it's mainly deriving from supported python versions from that dependency.

But thank you for mentioning this, we should probably ~~avoid specifying a constrained python version in the MR client itself, since we are not specifically tied to a python version ourselves~~ (edit: didn't have my coffee) annotate this explicitly in the project configuration.

May 31 '24 06:05 tarilabs

I tried 2 ways to have a Python 3.10 environment in order to pip install model-registry which requires Python version < 3.11 >3.8

I have:

kubeflow 1.9.0-rc.0 ml-metadata==1.15.0 model-registry==0.1.2

create one by conda after login to the notebook pod created with image kubeflownotebookswg/jupyter-tensorflow-full:v1.8.0-rc.6.
docker build an image with Python 3.10 and use it to create a notebook.

And then I used them to try to reproduce the results of the sample codes on https://www.kubeflow.org/docs/components/model-registry/getting-started/ (the steps on https://www.kubeflow.org/docs/components/model-registry/installation/ are ok.)

Both have the same problem:

registeredmodel_name = "mnist"
version_name = "v0.1"
rm = registry.register_model(registeredmodel_name,
                                "https://github.com/tarilabs/demo20231212/raw/main/v1.nb20231206162408/mnist.onnx",
                                model_format_name="onnx",
                                model_format_version="1",
                                version=version_name,
                                description="lorem ipsum mnist",
                                metadata={
                                    "accuracy": 3.14,
                                    "license": "apache-2.0",
                                }
                                )

---------------------------------------------------------------------------
_InactiveRpcError                         Traceback (most recent call last)
File [/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py:237](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py#line=236), in MetadataStore._call_method(self, method_name, request, response)
    236 try:
--> 237   response.CopyFrom(grpc_method(request, timeout=self._grpc_timeout_sec))
    238 except grpc.RpcError as e:
    239   # RpcError code uses a tuple to specify error code and short
    240   # description.
    241   # https://grpc.github.io/grpc/python/_modules/grpc.html#StatusCode

File /opt/conda/lib/python3.10/site-packages/grpc/_channel.py:1181, in _UnaryUnaryMultiCallable.__call__(self, request, timeout, metadata, credentials, wait_for_ready, compression)
   1175 (
   1176     state,
   1177     call,
   1178 ) = self._blocking(
   1179     request, timeout, metadata, credentials, wait_for_ready, compression
   1180 )
-> 1181 return _end_unary_response_blocking(state, call, False, None)

File [/opt/conda/lib/python3.10/site-packages/grpc/_channel.py:1006](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/grpc/_channel.py#line=1005), in _end_unary_response_blocking(state, call, with_call, deadline)
   1005 else:
-> 1006     raise _InactiveRpcError(state)

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "upstream connect error or disconnect[/reset](http://vm-a:8080/reset) before headers. retried and the latest reset reason: connection failure"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-06-01T12:16:03.20874579+00:00", grpc_status:14, grpc_message:"upstream connect error or disconnect[/reset](http://vm-a:8080/reset) before headers. retried and the latest reset reason: connection failure"}"
>

The above exception was the direct cause of the following exception:

UnavailableError                          Traceback (most recent call last)
Cell In[4], line 3
      1 registeredmodel_name = "mnist"
      2 version_name = "v0.1"
----> 3 rm = registry.register_model(registeredmodel_name,
      4                                 "https://github.com/tarilabs/demo20231212/raw/main/v1.nb20231206162408/mnist.onnx",
      5                                 model_format_name="onnx",
      6                                 model_format_version="1",
      7                                 version=version_name,
      8                                 description="lorem ipsum mnist",
      9                                 metadata={
     10                                     "accuracy": 3.14,
     11                                     "license": "apache-2.0",
     12                                 }
     13                                 )

File /opt/conda/lib/python3.10/site-packages/model_registry/_client.py:107, in ModelRegistry.register_model(self, name, uri, model_format_name, model_format_version, version, author, description, storage_key, storage_path, service_account_name, metadata)
     70 def register_model(
     71     self,
     72     name: str,
   (...)
     83     metadata: dict[str, ScalarType] | None = None,
     84 ) -> RegisteredModel:
     85     """Register a model.
     86 
     87     Either `storage_key` and `storage_path`, or `service_account_name` must be provided.
   (...)
    105         Registered model.
    106     """
--> 107     rm = self._register_model(name)
    108     mv = self._register_new_version(
    109         rm,
    110         version,
   (...)
    113         metadata=metadata or self.default_metadata(),
    114     )
    115     self._register_model_artifact(
    116         mv,
    117         uri,
   (...)
    122         service_account_name=service_account_name,
    123     )

File [/opt/conda/lib/python3.10/site-packages/model_registry/_client.py:43](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/model_registry/_client.py#line=42), in ModelRegistry._register_model(self, name)
     42 def _register_model(self, name: str) -> RegisteredModel:
---> 43     if rm := self._api.get_registered_model_by_params(name):
     44         return rm
     46     rm = RegisteredModel(name)

File [/opt/conda/lib/python3.10/site-packages/model_registry/core.py:121](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/model_registry/core.py#line=120), in ModelRegistryAPIClient.get_registered_model_by_params(self, name, external_id)
    119     msg = "Either name or external_id must be provided"
    120     raise StoreException(msg)
--> 121 proto_rm = self._store.get_context(
    122     RegisteredModel.get_proto_type_name(),
    123     name=name,
    124     external_id=external_id,
    125 )
    126 if proto_rm is not None:
    127     return RegisteredModel.unmap(proto_rm)

File [/opt/conda/lib/python3.10/site-packages/model_registry/store/wrapper.py:155](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/model_registry/store/wrapper.py#line=154), in MLMDStore.get_context(self, ctx_type_name, id, name, external_id)
    137 """Get a context from the store.
    138 
    139 This gets a context either by ID, name or external ID.
   (...)
    152     StoreException: Invalid arguments.
    153 """
    154 if name is not None:
--> 155     return self._mlmd_store.get_context_by_type_and_name(ctx_type_name, name)
    157 if id is not None:
    158     contexts = self._mlmd_store.get_contexts_by_id([id])

File [/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py:1631](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py#line=1630), in MetadataStore.get_context_by_type_and_name(***failed resolving arguments***)
   1628   request.type_version = type_version
   1629 response = metadata_store_service_pb2.GetContextByTypeAndNameResponse()
-> 1631 self._call('GetContextByTypeAndName', request, response)
   1632 if not response.HasField('context'):
   1633   return None

File [/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py:212](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py#line=211), in MetadataStore._call(***failed resolving arguments***)
    210 while True:
    211   try:
--> 212     return self._call_method(method_name, request, response)
    213   except errors.AbortedError:
    214     num_retries -= 1

File [/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py:242](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py#line=241), in MetadataStore._call_method(self, method_name, request, response)
    237   response.CopyFrom(grpc_method(request, timeout=self._grpc_timeout_sec))
    238 except grpc.RpcError as e:
    239   # RpcError code uses a tuple to specify error code and short
    240   # description.
    241   # https://grpc.github.io/grpc/python/_modules/grpc.html#StatusCode
--> 242   raise errors.make_exception(e.details(), e.code().value[0]) from e

UnavailableError: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure

Jun 01 '24 13:06 tiansiyuan

@tiansiyuan the reproducer is missing connection details, and the error shows MLMD library unable to connect to the gRPC service. Are you sure you are using the correct connection details, as mentioned in the tutorial ?

The tutorial shows a way to progressively ensure the MR is up, so is that part working?

For the Python version problem, and I would like to underline the requirement is coming from the MLMD library, you could also try with this workaround: https://github.com/kubeflow/model-registry/pull/116/files#diff-6b074bce6a463d7cd6b69e5b1901d4d48c6ff2cd150a40ce849f7a99cb68bce4R105 if you are in the scenario described in the disclaimer notice. Hope that helps!

Jun 01 '24 14:06 tarilabs

The tutorial shows a way to progressively ensure the MR is up, so is that part working?

This part works, as illustrated on https://www.kubeflow.org/docs/components/model-registry/installation/

ecs-user@vm-a:~$ kubectl wait --for=condition=available -n kubeflow deployment/model-registry-deployment --timeout=1m
deployment.apps/model-registry-deployment condition met
ecs-user@vm-a:~$ kubectl logs -n kubeflow deployment/model-registry-deployment
I0601 12:13:19.694246       1 proxy.go:32] proxy server started at 0.0.0.0:8080
I0601 12:13:19.694306       1 proxy.go:38] connecting to MLMD server localhost:9090..
I0601 12:13:35.241023       1 proxy.go:50] connected to MLMD server
ecs-user@vm-a:~$ curl -X 'GET' \
>   'http://localhost:8081/api/model_registry/v1alpha3/registered_models?pageSize=100&orderBy=ID&sortOrder=DESC' \
>   -H 'accept: application/json' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    56  100    56    0     0   6222      0 --:--:-- --:--:-- --:--:--  6222
{
  "items": [],
  "nextPageToken": "",
  "pageSize": 100,
  "size": 0
}

Are you sure you are using the correct connection details, as mentioned in the tutorial ?

I followed the steps on https://www.kubeflow.org/docs/components/model-registry/getting-started/

Jun 01 '24 14:06 tiansiyuan

@tiansiyuan awesome, so it seems a connection issue between your notebook (which seems to be inside a VM, so not sure about that setup...).

This makes me believe I should add a:

~~add example instructions for dry-run REST API inside of notebook~~ tracked in https://github.com/kubeflow/model-registry/issues/109#issue-2309990930

My suggestion atm would be to try the workaround of the Python version constrained by the library, as per my previous comment, in a standard Notebook. Hope this helps!

Jun 01 '24 14:06 tarilabs

To clarify, the notebook I use is a Kubeflow notebook, looks like:

By the way, there are something not accurate on https://www.kubeflow.org/docs/components/model-registry/installation/, like:

You can skip this step if you have already installed Kubeflow >=1.9. Your Kubeflow deployment includes Model Registry, Model Registry is not installed together with Kubeflow 1.9, say 1.9.0-rc.0.
modify ref=main to ref=v0.1.2, ref=v0.1.2 does not work. ref=main needs to be kept.

Jun 01 '24 14:06 tiansiyuan

we align with Google MLMD here: https://pypi.org/project/ml-metadata/1.14.0/#files

The latest version of ml-metadata is 1.15.0, it supports python 3.9/3.10/3.11.

Jun 01 '24 15:06 tiansiyuan

@tiansiyuan

By the way, there are something not accurate on https://www.kubeflow.org/docs/components/model-registry/installation/, like:

You can skip this step if you have already installed Kubeflow >=1.9. Your Kubeflow deployment includes Model Registry, Model Registry is not installed together with Kubeflow 1.9, say 1.9.0-rc.0.

this is being corrected with https://github.com/kubeflow/website/pull/3740/files#diff-c3b16833ded8b5282aa1d0c8f6caf09c31b1e44b47f27086757b195a9031f9e8R26 as the decision for Alpha components (such as model registry) not to be included by default was determined in later KF community meetings.

You are welcome to suggest further corrections on that open PR / other PRs if something you believe still missing.

modify ref=main to ref=v0.1.2, ref=v0.1.2 does not work. ref=main needs to be kept.

per above.

The latest version of ml-metadata is 1.15.0, it supports python 3.9/3.10/3.11

interesting assertion because here: https://pypi.org/project/ml-metadata/1.15.0/#files I see only CPython 3.9, 3.10 and further if compared to previous MLMD library releases the number of supported combination appears to be less. I will however check it out practically! Thanks for pointing it out..

Thanks for the feedback @tiansiyuan I'm curious to hear if you tried out the workaround I mentioned in https://github.com/kubeflow/model-registry/issues/90#issuecomment-2143460793

Jun 01 '24 16:06 tarilabs

The latest version of ml-metadata is 1.15.0, it supports python 3.9/3.10/3.11

as mentioned in my previous comments it does not, as this project/reproducer demonstrates: https://github.com/tarilabs/demo20240601-mlmdversions attached screenshot of my local linux box, and contains github actions demonstrating the same remotely.

I believe for MLMD 1.15.0 they advertise on pypi for >=3.9, but they don't distribute for anything >=3.11, so pragmatically that is available ONLY for 3.9/3.10 for as far as I can see.

Let me know if you believe I missed anything; hope this clarifies.

Jun 01 '24 20:06 tarilabs

The latest version of ml-metadata is 1.15.0, it supports python 3.9/3.10/3.11

as mentioned in my previous comments it does not, as this project/reproducer demonstrates: https://github.com/tarilabs/demo20240601-mlmdversions attached screenshot of my local linux box, and contains github actions demonstrating the same remotely.

I believe for MLMD 1.15.0 they advertise on pypi for >=3.9, but they don't distribute for anything >=3.11, so pragmatically that is available ONLY for 3.9/3.10 for as far as I can see.

Let me know if you believe I missed anything; hope this clarifies.

Yes, I tried to pip install ml-metadata in a Python 3.11 env, it installs ml-metadata==0.13.1.dev0.

If I pip install ml-metadata==1.15.0, it gives the following error message:

ERROR: Could not find a version that satisfies the requirement ml-metadata==1.15.0 (from versions: 0.12.0.dev0, 0.13.0.dev0, 0.13.1.dev0) ERROR: No matching distribution found for ml-metadata==1.15.0

I will try your workaround and let you know.

Thanks you @tarilabs !

Jun 01 '24 22:06 tiansiyuan

Hi @tarilabs

I tried the workaround: https://github.com/kubeflow/model-registry/pull/116/files#diff-6b074bce6a463d7cd6b69e5b1901d4d48c6ff2cd150a40ce849f7a99cb68bce4R105

And got the following result:

Successfully installed absl-py-1.4.0 attrs-21.4.0 ml-metadata-1.14.0+remote.1

Successfully installed model-registry-0.2.1a1

Then,

Jun 01 '24 23:06 tiansiyuan

@tiansiyuan I'm glad to hear you concur about MLMD requirements and that the workaround is applicable to your scenario and that it works.

For the "user token" ~~appears to be a bug, will investigate further, in the meantime this works for me:~~

Edit: pardon me the correct method for connection is:

registry = ModelRegistry(server_address="model-registry-service.kubeflow.svc.cluster.local", port=9090, author="mmortari", is_secure=False)

which works on my end with "0.2.1a1".

Jun 02 '24 06:06 tarilabs

On a Python 3.11.6 env, with the following packages installed:

ml-metadata-1.14.0
model-registry==0.1.2 (without is_secure=False) or model-registry-0.2.1a1 (with is_secure=False)

I still have the timeout issue.

Jun 02 '24 08:06 tiansiyuan

I still have the timeout issue.

That's very strange @tiansiyuan :thinking: looks like a connection issue between the Notebook and the service, on your setup.

I've double-checked again, and per this screenshot, it's working as expected on my end: Screenshot from 2024-06-02 12-09-53

do you want to try the curl command from within your Notebook?

curl model-registry-service.kubeflow.svc.cluster.local:8080/api/model_registry/v1alpha3/registered_models

As you can see in screenshot, my vanilla Notebook just connect as expected to model registry per tutorial, I'm showing both going through using python client and another tab to demonstrate I can reach rest api, as expected.

Note-to-self, environment details:

KF manifests v1.9.0-rc.0
manifests/kustomize/overlays/db/ from MR main
manifests/kustomize/options/istio/ from MR main
Python 3.11, just to mimic closely OP's constraints:

Jun 02 '24 10:06 tarilabs

curl model-registry-service.kubeflow.svc.cluster.local:8080/api/model_registry/v1alpha3/registered_models

The output of this command is:

upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure

Any other way I can debug it?

Thanks

Jun 02 '24 13:06 tiansiyuan

Any other way I can debug it? Thanks

To me this shows some difference in the KF platform on your setup, or something else might be influencing in your Kubernetes cluster used.

Could you try using the same Notebook image (v1.8.0) as per my details expansion, just to be sure?

Otherwise I'm short on ideas atm. On a Vanilla KF installation (1.9.0-rc.0) as shown in my previous comment/screenshot, the Notebook can reach the service both REST and gRPC.

Jun 02 '24 13:06 tarilabs

I just created a notebook using image kubeflownotebookswg/jupyter-tensorflow-full:v1.8.0.

And run the command curl model-registry-service.kubeflow.svc.cluster.local:8080/api/model_registry/v1alpha3/registered_models,

and got the same result: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure

Jun 02 '24 13:06 tiansiyuan

@tiansiyuan thanks, so all else equal, can't think of something beyond KF at play.

Can you share details of which Kubernetes environment are you using? Which version of Minikube (if its a minikube)?

Jun 02 '24 14:06 tarilabs

I was using Microk8s 1.30 on a Ubunut 20.04.

ecs-user@vm-a:~$ kubectl version
Client Version: v1.30.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.1
ecs-user@vm-a:~$ kustomize version
v5.2.1
ecs-user@vm-a:~$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

I also tried on a VMware TKGS cluster with a Python 3.10 env and the workaround for Python 3.11, I also reproduced the timeout issue, both gRPC and RESTful API:

$ kubectl version
Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.25.7+vmware.3-fips.1
WARNING: version difference between client (1.29) and server (1.25) exceeds the supported minor version skew of +/-1

Jun 02 '24 22:06 tiansiyuan

@tiansiyuan I believe on your testing cluster since network policies are enforced by default, you need this: https://github.com/kubeflow/manifests/pull/2724

Jun 03 '24 13:06 tarilabs

@tarilabs Yes! it works for me.

Jun 03 '24 21:06 tiansiyuan

However, when I follow the example code on https://www.kubeflow.org/docs/components/model-registry/getting-started/

from model_registry import ModelRegistry

registry = ModelRegistry(server_address="model-registry-service.kubeflow.svc.cluster.local", port=9090, author="mmortari")

lookup_name = "mnist"
lookup_version="v20231206163028"

print("RegisteredModel:")
registered_model = registry.get_registered_model(lookup_name)
print(registered_model)
print("ModelVersion:")
model_version = registry.get_model_version(lookup_name, lookup_version)
print(model_version)
print("ModelArtifact:")
model_artifact = registry.get_model_artifact(lookup_name, lookup_version)
print(model_artifact)

storage_uri = model_artifact.uri
model_format_name = model_artifact.model_format_name
model_format_version = model_artifact.model_format_version

I've got:

---------------------------------------------------------------------------
StoreException                            Traceback (most recent call last)
Cell In[14], line 15
     13 print(model_version)
     14 print("ModelArtifact:")
---> 15 model_artifact = registry.get_model_artifact(lookup_name, lookup_version)
     16 print(model_artifact)
     18 storage_uri = model_artifact.uri

File [/opt/conda/lib/python3.10/site-packages/model_registry/_client.py:277](http://10.152.183.38.nip.io/opt/conda/lib/python3.10/site-packages/model_registry/_client.py#line=276), in ModelRegistry.get_model_artifact(self, name, version)
    275 if not (mv := self.get_model_version(name, version)):
    276     msg = f"Version {version} does not exist"
--> 277     raise StoreException(msg)
    278 return self._api.get_model_artifact_by_params(mv.id)

StoreException: Version v20231206163028 does not exist

This seems easy to solve.

Jun 03 '24 21:06 tiansiyuan

lookup_version="v20231206163028" needs to be changed to: lookup_version="v0.1" or something consistent with the above step, version_name = "v0.1".

Jun 04 '24 04:06 tiansiyuan

the 2 tutorials in "getting started" are not necessarily strongly tied together, but fair point I've added a follow-up task into: https://github.com/kubeflow/model-registry/issues/109#issue-2309990930

I'm glad to hear that solved your connection issue @tiansiyuan and thanks for all these feedback!

Jun 04 '24 07:06 tarilabs

lookup_version="v20231206163028" needs to be changed to: lookup_version="v0.1" or something consistent with the above step, version_name = "v0.1".

They are on the same page: https://www.kubeflow.org/docs/components/model-registry/getting-started/

Thank you for your help and patience. @tarilabs

Jun 04 '24 07:06 tiansiyuan

I tried the workaround Python 3.11.9 on Kubeflow 1.9.0-rc.1 on Microk8s v1.28.9.

It works. Just gives warning:

[/tmp/ipykernel_106/1782746478.py:3](http://10.152.183.133.nip.io/tmp/ipykernel_106/1782746478.py#line=2): UserWarning: User access token is missing
  registry = ModelRegistry(server_address="model-registry-service.kubeflow.svc.cluster.local", port=9090, author="mmortari", is_secure=False)

Jun 04 '24 23:06 tiansiyuan

as discussed on the KF bi-weekly 2024-06-10, closing this issue as we have at least 2 diff k8s environment tested with KF 1.9 rc (s)

(we will open new ones if a specific issue or bug is detected)

Jun 10 '24 17:06 tarilabs