can verify the MR manifest with the KF 1.9.rc0
I've tested with 1.9.0-rc.0 and beyond #101 lgtm
Hi @tarilabs ,
Which notebook image did you use to test? kubeflownotebookswg/jupyter-tensorflow-full:v1.8.0-rc.6 includes Python 3.11.6 which is not supported by model-registry.
Thanks
Tian
Which notebook image did you use to test?
kubeflownotebookswg/jupyter-tensorflow-full:v1.8.0-rc.6includes Python 3.11.6 which is not supported by model-registry.
we align with Google MLMD here: https://pypi.org/project/ml-metadata/1.14.0/#files
so it's mainly deriving from supported python versions from that dependency.
But thank you for mentioning this, we should probably ~~avoid specifying a constrained python version in the MR client itself, since we are not specifically tied to a python version ourselves~~ (edit: didn't have my coffee) annotate this explicitly in the project configuration.
I tried 2 ways to have a Python 3.10 environment in order to pip install model-registry which requires Python version < 3.11 >3.8
I have:
kubeflow 1.9.0-rc.0 ml-metadata==1.15.0 model-registry==0.1.2
- create one by conda after login to the notebook pod created with image
kubeflownotebookswg/jupyter-tensorflow-full:v1.8.0-rc.6. - docker build an image with Python 3.10 and use it to create a notebook.
And then I used them to try to reproduce the results of the sample codes on https://www.kubeflow.org/docs/components/model-registry/getting-started/ (the steps on https://www.kubeflow.org/docs/components/model-registry/installation/ are ok.)
Both have the same problem:
registeredmodel_name = "mnist"
version_name = "v0.1"
rm = registry.register_model(registeredmodel_name,
"https://github.com/tarilabs/demo20231212/raw/main/v1.nb20231206162408/mnist.onnx",
model_format_name="onnx",
model_format_version="1",
version=version_name,
description="lorem ipsum mnist",
metadata={
"accuracy": 3.14,
"license": "apache-2.0",
}
)
---------------------------------------------------------------------------
_InactiveRpcError Traceback (most recent call last)
File [/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py:237](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py#line=236), in MetadataStore._call_method(self, method_name, request, response)
236 try:
--> 237 response.CopyFrom(grpc_method(request, timeout=self._grpc_timeout_sec))
238 except grpc.RpcError as e:
239 # RpcError code uses a tuple to specify error code and short
240 # description.
241 # https://grpc.github.io/grpc/python/_modules/grpc.html#StatusCode
File /opt/conda/lib/python3.10/site-packages/grpc/_channel.py:1181, in _UnaryUnaryMultiCallable.__call__(self, request, timeout, metadata, credentials, wait_for_ready, compression)
1175 (
1176 state,
1177 call,
1178 ) = self._blocking(
1179 request, timeout, metadata, credentials, wait_for_ready, compression
1180 )
-> 1181 return _end_unary_response_blocking(state, call, False, None)
File [/opt/conda/lib/python3.10/site-packages/grpc/_channel.py:1006](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/grpc/_channel.py#line=1005), in _end_unary_response_blocking(state, call, with_call, deadline)
1005 else:
-> 1006 raise _InactiveRpcError(state)
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "upstream connect error or disconnect[/reset](http://vm-a:8080/reset) before headers. retried and the latest reset reason: connection failure"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-06-01T12:16:03.20874579+00:00", grpc_status:14, grpc_message:"upstream connect error or disconnect[/reset](http://vm-a:8080/reset) before headers. retried and the latest reset reason: connection failure"}"
>
The above exception was the direct cause of the following exception:
UnavailableError Traceback (most recent call last)
Cell In[4], line 3
1 registeredmodel_name = "mnist"
2 version_name = "v0.1"
----> 3 rm = registry.register_model(registeredmodel_name,
4 "https://github.com/tarilabs/demo20231212/raw/main/v1.nb20231206162408/mnist.onnx",
5 model_format_name="onnx",
6 model_format_version="1",
7 version=version_name,
8 description="lorem ipsum mnist",
9 metadata={
10 "accuracy": 3.14,
11 "license": "apache-2.0",
12 }
13 )
File /opt/conda/lib/python3.10/site-packages/model_registry/_client.py:107, in ModelRegistry.register_model(self, name, uri, model_format_name, model_format_version, version, author, description, storage_key, storage_path, service_account_name, metadata)
70 def register_model(
71 self,
72 name: str,
(...)
83 metadata: dict[str, ScalarType] | None = None,
84 ) -> RegisteredModel:
85 """Register a model.
86
87 Either `storage_key` and `storage_path`, or `service_account_name` must be provided.
(...)
105 Registered model.
106 """
--> 107 rm = self._register_model(name)
108 mv = self._register_new_version(
109 rm,
110 version,
(...)
113 metadata=metadata or self.default_metadata(),
114 )
115 self._register_model_artifact(
116 mv,
117 uri,
(...)
122 service_account_name=service_account_name,
123 )
File [/opt/conda/lib/python3.10/site-packages/model_registry/_client.py:43](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/model_registry/_client.py#line=42), in ModelRegistry._register_model(self, name)
42 def _register_model(self, name: str) -> RegisteredModel:
---> 43 if rm := self._api.get_registered_model_by_params(name):
44 return rm
46 rm = RegisteredModel(name)
File [/opt/conda/lib/python3.10/site-packages/model_registry/core.py:121](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/model_registry/core.py#line=120), in ModelRegistryAPIClient.get_registered_model_by_params(self, name, external_id)
119 msg = "Either name or external_id must be provided"
120 raise StoreException(msg)
--> 121 proto_rm = self._store.get_context(
122 RegisteredModel.get_proto_type_name(),
123 name=name,
124 external_id=external_id,
125 )
126 if proto_rm is not None:
127 return RegisteredModel.unmap(proto_rm)
File [/opt/conda/lib/python3.10/site-packages/model_registry/store/wrapper.py:155](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/model_registry/store/wrapper.py#line=154), in MLMDStore.get_context(self, ctx_type_name, id, name, external_id)
137 """Get a context from the store.
138
139 This gets a context either by ID, name or external ID.
(...)
152 StoreException: Invalid arguments.
153 """
154 if name is not None:
--> 155 return self._mlmd_store.get_context_by_type_and_name(ctx_type_name, name)
157 if id is not None:
158 contexts = self._mlmd_store.get_contexts_by_id([id])
File [/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py:1631](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py#line=1630), in MetadataStore.get_context_by_type_and_name(***failed resolving arguments***)
1628 request.type_version = type_version
1629 response = metadata_store_service_pb2.GetContextByTypeAndNameResponse()
-> 1631 self._call('GetContextByTypeAndName', request, response)
1632 if not response.HasField('context'):
1633 return None
File [/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py:212](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py#line=211), in MetadataStore._call(***failed resolving arguments***)
210 while True:
211 try:
--> 212 return self._call_method(method_name, request, response)
213 except errors.AbortedError:
214 num_retries -= 1
File [/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py:242](http://vm-a:8080/opt/conda/lib/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py#line=241), in MetadataStore._call_method(self, method_name, request, response)
237 response.CopyFrom(grpc_method(request, timeout=self._grpc_timeout_sec))
238 except grpc.RpcError as e:
239 # RpcError code uses a tuple to specify error code and short
240 # description.
241 # https://grpc.github.io/grpc/python/_modules/grpc.html#StatusCode
--> 242 raise errors.make_exception(e.details(), e.code().value[0]) from e
UnavailableError: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure
@tiansiyuan the reproducer is missing connection details, and the error shows MLMD library unable to connect to the gRPC service. Are you sure you are using the correct connection details, as mentioned in the tutorial ?
The tutorial shows a way to progressively ensure the MR is up, so is that part working?
For the Python version problem, and I would like to underline the requirement is coming from the MLMD library, you could also try with this workaround: https://github.com/kubeflow/model-registry/pull/116/files#diff-6b074bce6a463d7cd6b69e5b1901d4d48c6ff2cd150a40ce849f7a99cb68bce4R105 if you are in the scenario described in the disclaimer notice. Hope that helps!
The tutorial shows a way to progressively ensure the MR is up, so is that part working?
This part works, as illustrated on https://www.kubeflow.org/docs/components/model-registry/installation/
ecs-user@vm-a:~$ kubectl wait --for=condition=available -n kubeflow deployment/model-registry-deployment --timeout=1m
deployment.apps/model-registry-deployment condition met
ecs-user@vm-a:~$ kubectl logs -n kubeflow deployment/model-registry-deployment
I0601 12:13:19.694246 1 proxy.go:32] proxy server started at 0.0.0.0:8080
I0601 12:13:19.694306 1 proxy.go:38] connecting to MLMD server localhost:9090..
I0601 12:13:35.241023 1 proxy.go:50] connected to MLMD server
ecs-user@vm-a:~$ curl -X 'GET' \
> 'http://localhost:8081/api/model_registry/v1alpha3/registered_models?pageSize=100&orderBy=ID&sortOrder=DESC' \
> -H 'accept: application/json' | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 56 100 56 0 0 6222 0 --:--:-- --:--:-- --:--:-- 6222
{
"items": [],
"nextPageToken": "",
"pageSize": 100,
"size": 0
}
Are you sure you are using the correct connection details, as mentioned in the tutorial ?
I followed the steps on https://www.kubeflow.org/docs/components/model-registry/getting-started/
@tiansiyuan awesome, so it seems a connection issue between your notebook (which seems to be inside a VM, so not sure about that setup...).
This makes me believe I should add a:
- ~~add example instructions for dry-run REST API inside of notebook~~ tracked in https://github.com/kubeflow/model-registry/issues/109#issue-2309990930
My suggestion atm would be to try the workaround of the Python version constrained by the library, as per my previous comment, in a standard Notebook. Hope this helps!
To clarify, the notebook I use is a Kubeflow notebook, looks like:
By the way, there are something not accurate on https://www.kubeflow.org/docs/components/model-registry/installation/, like:
-
You can skip this step if you have already installed Kubeflow >=1.9. Your Kubeflow deployment includes Model Registry, Model Registry is not installed together with Kubeflow 1.9, say 1.9.0-rc.0. -
modify ref=main to ref=v0.1.2, ref=v0.1.2 does not work. ref=main needs to be kept.
we align with Google MLMD here: https://pypi.org/project/ml-metadata/1.14.0/#files
The latest version of ml-metadata is 1.15.0, it supports python 3.9/3.10/3.11.
@tiansiyuan
By the way, there are something not accurate on https://www.kubeflow.org/docs/components/model-registry/installation/, like:
You can skip this step if you have already installed Kubeflow >=1.9. Your Kubeflow deployment includes Model Registry, Model Registry is not installed together with Kubeflow 1.9, say 1.9.0-rc.0.
this is being corrected with https://github.com/kubeflow/website/pull/3740/files#diff-c3b16833ded8b5282aa1d0c8f6caf09c31b1e44b47f27086757b195a9031f9e8R26 as the decision for Alpha components (such as model registry) not to be included by default was determined in later KF community meetings.
You are welcome to suggest further corrections on that open PR / other PRs if something you believe still missing.
modify ref=main to ref=v0.1.2, ref=v0.1.2 does not work. ref=main needs to be kept.
per above.
The latest version of ml-metadata is 1.15.0, it supports python 3.9/3.10/3.11
interesting assertion because here: https://pypi.org/project/ml-metadata/1.15.0/#files I see only CPython 3.9, 3.10 and further if compared to previous MLMD library releases the number of supported combination appears to be less. I will however check it out practically! Thanks for pointing it out..
Thanks for the feedback @tiansiyuan I'm curious to hear if you tried out the workaround I mentioned in https://github.com/kubeflow/model-registry/issues/90#issuecomment-2143460793
The latest version of ml-metadata is 1.15.0, it supports python 3.9/3.10/3.11
as mentioned in my previous comments it does not, as this project/reproducer demonstrates: https://github.com/tarilabs/demo20240601-mlmdversions attached screenshot of my local linux box, and contains github actions demonstrating the same remotely.
I believe for MLMD 1.15.0 they advertise on pypi for >=3.9, but they don't distribute for anything >=3.11, so pragmatically that is available ONLY for 3.9/3.10 for as far as I can see.
Let me know if you believe I missed anything; hope this clarifies.
The latest version of ml-metadata is 1.15.0, it supports python 3.9/3.10/3.11
as mentioned in my previous comments it does not, as this project/reproducer demonstrates: https://github.com/tarilabs/demo20240601-mlmdversions attached screenshot of my local linux box, and contains github actions demonstrating the same remotely.
I believe for MLMD 1.15.0 they advertise on pypi for >=3.9, but they don't distribute for anything >=3.11, so pragmatically that is available ONLY for 3.9/3.10 for as far as I can see.
Let me know if you believe I missed anything; hope this clarifies.
Yes, I tried to pip install ml-metadata in a Python 3.11 env, it installs ml-metadata==0.13.1.dev0.
If I pip install ml-metadata==1.15.0, it gives the following error message:
ERROR: Could not find a version that satisfies the requirement ml-metadata==1.15.0 (from versions: 0.12.0.dev0, 0.13.0.dev0, 0.13.1.dev0) ERROR: No matching distribution found for ml-metadata==1.15.0
I will try your workaround and let you know.
Thanks you @tarilabs !
Hi @tarilabs
I tried the workaround: https://github.com/kubeflow/model-registry/pull/116/files#diff-6b074bce6a463d7cd6b69e5b1901d4d48c6ff2cd150a40ce849f7a99cb68bce4R105
And got the following result:
Successfully installed absl-py-1.4.0 attrs-21.4.0 ml-metadata-1.14.0+remote.1
Successfully installed model-registry-0.2.1a1
Then,
@tiansiyuan I'm glad to hear you concur about MLMD requirements and that the workaround is applicable to your scenario and that it works.
For the "user token" ~~appears to be a bug, will investigate further, in the meantime this works for me:~~
Edit: pardon me the correct method for connection is:
registry = ModelRegistry(server_address="model-registry-service.kubeflow.svc.cluster.local", port=9090, author="mmortari", is_secure=False)
which works on my end with "0.2.1a1".
On a Python 3.11.6 env, with the following packages installed:
- ml-metadata-1.14.0
- model-registry==0.1.2 (without is_secure=False) or model-registry-0.2.1a1 (with is_secure=False)
I still have the timeout issue.
I still have the timeout issue.
That's very strange @tiansiyuan :thinking: looks like a connection issue between the Notebook and the service, on your setup.
I've double-checked again, and per this screenshot, it's working as expected on my end:
do you want to try the curl command from within your Notebook?
curl model-registry-service.kubeflow.svc.cluster.local:8080/api/model_registry/v1alpha3/registered_models
As you can see in screenshot, my vanilla Notebook just connect as expected to model registry per tutorial, I'm showing both going through using python client and another tab to demonstrate I can reach rest api, as expected.
Note-to-self, environment details:
- KF manifests v1.9.0-rc.0
-
manifests/kustomize/overlays/db/from MR main -
manifests/kustomize/options/istio/from MR main - Python 3.11, just to mimic closely OP's constraints:
curl model-registry-service.kubeflow.svc.cluster.local:8080/api/model_registry/v1alpha3/registered_models
The output of this command is:
upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure
Any other way I can debug it?
Thanks
Any other way I can debug it? Thanks
To me this shows some difference in the KF platform on your setup, or something else might be influencing in your Kubernetes cluster used.
Could you try using the same Notebook image (v1.8.0) as per my details expansion, just to be sure?
Otherwise I'm short on ideas atm. On a Vanilla KF installation (1.9.0-rc.0) as shown in my previous comment/screenshot, the Notebook can reach the service both REST and gRPC.
I just created a notebook using image kubeflownotebookswg/jupyter-tensorflow-full:v1.8.0.
And run the command curl model-registry-service.kubeflow.svc.cluster.local:8080/api/model_registry/v1alpha3/registered_models,
and got the same result: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure
@tiansiyuan thanks, so all else equal, can't think of something beyond KF at play.
Can you share details of which Kubernetes environment are you using? Which version of Minikube (if its a minikube)?
I was using Microk8s 1.30 on a Ubunut 20.04.
ecs-user@vm-a:~$ kubectl version
Client Version: v1.30.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.1
ecs-user@vm-a:~$ kustomize version
v5.2.1
ecs-user@vm-a:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
I also tried on a VMware TKGS cluster with a Python 3.10 env and the workaround for Python 3.11, I also reproduced the timeout issue, both gRPC and RESTful API:
$ kubectl version
Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.25.7+vmware.3-fips.1
WARNING: version difference between client (1.29) and server (1.25) exceeds the supported minor version skew of +/-1
@tiansiyuan I believe on your testing cluster since network policies are enforced by default, you need this: https://github.com/kubeflow/manifests/pull/2724
@tarilabs Yes! it works for me.
However, when I follow the example code on https://www.kubeflow.org/docs/components/model-registry/getting-started/
from model_registry import ModelRegistry
registry = ModelRegistry(server_address="model-registry-service.kubeflow.svc.cluster.local", port=9090, author="mmortari")
lookup_name = "mnist"
lookup_version="v20231206163028"
print("RegisteredModel:")
registered_model = registry.get_registered_model(lookup_name)
print(registered_model)
print("ModelVersion:")
model_version = registry.get_model_version(lookup_name, lookup_version)
print(model_version)
print("ModelArtifact:")
model_artifact = registry.get_model_artifact(lookup_name, lookup_version)
print(model_artifact)
storage_uri = model_artifact.uri
model_format_name = model_artifact.model_format_name
model_format_version = model_artifact.model_format_version
I've got:
---------------------------------------------------------------------------
StoreException Traceback (most recent call last)
Cell In[14], line 15
13 print(model_version)
14 print("ModelArtifact:")
---> 15 model_artifact = registry.get_model_artifact(lookup_name, lookup_version)
16 print(model_artifact)
18 storage_uri = model_artifact.uri
File [/opt/conda/lib/python3.10/site-packages/model_registry/_client.py:277](http://10.152.183.38.nip.io/opt/conda/lib/python3.10/site-packages/model_registry/_client.py#line=276), in ModelRegistry.get_model_artifact(self, name, version)
275 if not (mv := self.get_model_version(name, version)):
276 msg = f"Version {version} does not exist"
--> 277 raise StoreException(msg)
278 return self._api.get_model_artifact_by_params(mv.id)
StoreException: Version v20231206163028 does not exist
This seems easy to solve.
lookup_version="v20231206163028" needs to be changed to: lookup_version="v0.1" or something consistent with the above step, version_name = "v0.1".
the 2 tutorials in "getting started" are not necessarily strongly tied together, but fair point I've added a follow-up task into: https://github.com/kubeflow/model-registry/issues/109#issue-2309990930
I'm glad to hear that solved your connection issue @tiansiyuan and thanks for all these feedback!
lookup_version="v20231206163028"needs to be changed to: lookup_version="v0.1" or something consistent with the above step,version_name = "v0.1".
They are on the same page: https://www.kubeflow.org/docs/components/model-registry/getting-started/
Thank you for your help and patience. @tarilabs
I tried the workaround Python 3.11.9 on Kubeflow 1.9.0-rc.1 on Microk8s v1.28.9.
It works. Just gives warning:
[/tmp/ipykernel_106/1782746478.py:3](http://10.152.183.133.nip.io/tmp/ipykernel_106/1782746478.py#line=2): UserWarning: User access token is missing
registry = ModelRegistry(server_address="model-registry-service.kubeflow.svc.cluster.local", port=9090, author="mmortari", is_secure=False)
as discussed on the KF bi-weekly 2024-06-10, closing this issue as we have at least 2 diff k8s environment tested with KF 1.9 rc (s)
(we will open new ones if a specific issue or bug is detected)