kubeflow-manifests
kubeflow-manifests copied to clipboard
Logs not showing for Inference Service
Describe the bug
Model Web app logs are not showing :
Browser console is showing that a request to get the logs is returning a 500 in a loop:
Steps To Reproduce
- Install kubeflow from
main
+v1.4.1
+cognito-rds-s3
andkfserving
setup
Expected behavior
- The logs are available and shown.
Environment
- Kubernetes version 1.21
- Using EKS (yes/no), if so version? Yes
- Kubeflow version 1.4.1
- AWS build number main
- AWS service targeted (S3, RDS, etc.) NA
Will repro this and see what is going on
Hi @AlexandreBrown, surprisingly I am unable to reproduce the issue so far.
My setup:
-
main
branch in this repo,v1.4.1
inkubeflow/manifests
- Cognito setup via https://github.com/awslabs/kubeflow-manifests/blob/main/docs/deployment/cognito/README-automated.md
- RDS/S3 setup via https://github.com/awslabs/kubeflow-manifests/tree/main/docs/deployment/rds-s3#21-option-1-automated-setup
- Created the following inference service using the yaml
-
apiVersion: "serving.kubeflow.org/v1beta1" kind: "InferenceService" metadata: name: "sklearn-iris" spec: predictor: sklearn: storageUri: "gs://kfserving-samples/models/sklearn/iris"
-
I was able to view logs:
However, I have a few questions:
- With regards to the
kfserving
setup you mentioned inSteps to Reproduce
which setup steps are your referring to? - If you have run the setup here did you also apply the authorization policy configurations you mentioned?
- Are you able to view the inference pod in the profile's namespace?
- If the pod is active is it's
kfserving-container
present and running without issues? It seems that the logs are pulled from this container - Here are the other containers in the pod if this turns out to be relevant
-
$ kubectl logs -n user1 sklearn-iris-predictor-default-00001-deployment-58f94dc4946sb55 kfserving-container istio-init istio-proxy kfserving-container queue-proxy storage-initializer
-
Additional debugging:
There are more logs in kfserving-models-web-app
pod. For me, I was able to describe the pod via:
kubectl logs -n kubeflow kfserving-models-web-app-7884f597cf-cllp5
In the same window where I am logged into the central dashboard as a user user1
, I opened a new tab and pasted the following url https://kubeflow.kfissue145.rkharse.people.aws.dev/models/api/namespaces/user1/inferenceservices/sklearn-iris?logs=true&component=predictor
.
Broken down this url is: https://<public dns endpoint>/models/api/namespaces/<user profile namespace>/inferenceservices/<model name>?logs=true&component=predictor
Meanwhile I tailed the logs
kubectl logs -n kubeflow kfserving-models-web-app-7884f597cf-cllp5 -f
and got the following response:
2022-03-24 21:33:37,422 | kubeflow.kubeflow.crud_backend.authn | INFO | Handling request for user: [email protected]
2022-03-24 21:33:37,422 | kubeflow.kubeflow.crud_backend.csrf | INFO | Skipping CSRF check for safe method: GET
2022-03-24 21:33:37,445 | apps.common.routes.get | INFO | ['predictor']
2022-03-24 21:33:37,511 | root | INFO | Component pods: {'predictor': ['sklearn-iris-predictor-default-00001-deployment-58f94dc4946sb55']}
127.0.0.1 - - [24/Mar/2022:21:33:37 +0000] "GET /api/namespaces/user1/inferenceservices/sklearn-iris?logs=true&component=predictor HTTP/1.1" 200 6137 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Firefox/91.0"
In your case I believe there will be more details here and a 500 error.
I will continue trying to repro but feel free to let me know if this helped uncover any additional details.
@rrrkharse
With regards to the kfserving setup you mentioned in Steps to Reproduce which setup steps are your referring to?
I am referring to this guide
If you have run the setup https://github.com/awslabs/kubeflow-manifests/issues/82 did you also apply the authorization policy configurations you https://github.com/awslabs/kubeflow-manifests/issues/82#issuecomment-1068641378?
Yes I have, I also tried re-installing Kubeflow multiple times, didnt fix it.
Are you able to view the inference pod in the profile's namespace?
Will check this as well as your other questions and get back to you.
I tried visiting my https://
and the output was :
{
"log": "An error occured in the backend.",
"status": 500,
"success": false,
"user": "my-email"
}
Here is the logs of the network tab on chrome
Looks like only the logs is causing failure, the model is recognized as healthy.
Also, I am using Kubeflow without port-forward and I've setup the grafana metrics and that works, it's just the Logs that do not work for some odd reason.
@rrrkharse @goswamig @surajkota @ryansteakley
Hello, could we get some help on this issue?
Command :
kubectl get inferenceservice -n dev
Output :
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
sklearn-iris http://sklearn-iris.dev.ai-rd.mdinfinity.com/ True 100 sklearn-iris-predictor-default-00001 3m35s
Command :
kubectl get pods --all-namespaces
Output :
NAMESPACE NAME READY STATUS RESTARTS AGE
dev sklearn-iris-predictor-default-00001-deployment-5558697cb-t8q7b 3/3 Running 0 3m32s
(All other pods running)
Command :
kubectl logs -n dev sklearn-iris-predictor-default-00001-deployment-5558697cb-t8q7b -c kfserving-container
Output :
[I 220404 23:46:49 storage:50] Copying contents of /mnt/models to local
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:35: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
eps=np.finfo(np.float).eps,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:597: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:836: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:862: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
eps=np.finfo(np.float).eps, positive=False):
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:1097: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:1344: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:1480: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
eps=np.finfo(np.float).eps, copy_X=True, positive=False):
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/randomized_l1.py:152: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
precompute=False, eps=np.finfo(np.float).eps,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/randomized_l1.py:320: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
eps=np.finfo(np.float).eps, random_state=None,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/randomized_l1.py:580: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
eps=4 * np.finfo(np.float).eps, n_jobs=None,
[I 220404 23:46:49 kfserver:151] Registering model: sklearn-iris
[I 220404 23:46:49 kfserver:121] Setting asyncio max_workers as 5
[I 220404 23:46:49 kfserver:128] Listening on port 8080
[I 220404 23:46:49 kfserver:130] Will fork 1 workers
Command :
kubectl logs -n kubeflow kfserving-models-web-app-7884f597cf-bgpk4 --tail=30
Output :
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/kfserving-models-web-app-7884f597cf-bgpk4. Please use `kubectl.kubernetes.io/default-container` instead
2022-04-05 00:20:26,660 | kubeflow.kubeflow.crud_backend.authn | INFO | Handling request for user: [email protected]
2022-04-05 00:20:26,660 | kubeflow.kubeflow.crud_backend.csrf | INFO | Skipping CSRF check for safe method: GET
127.0.0.1 - - [05/Apr/2022:00:20:26 +0000] "GET /api/namespaces/dev/configurations/sklearn-iris-predictor-default HTTP/1.1" 200 5691 "https://our_kubeflow_url.com/models/details/dev/sklearn-iris" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
2022-04-05 00:20:26,731 | kubeflow.kubeflow.crud_backend.authn | INFO | Handling request for user: [email protected]
2022-04-05 00:20:26,731 | kubeflow.kubeflow.crud_backend.csrf | INFO | Skipping CSRF check for safe method: GET
127.0.0.1 - - [05/Apr/2022:00:20:26 +0000] "GET /api/namespaces/dev/knativeServices/sklearn-iris-predictor-default HTTP/1.1" 200 6253 "https://our_kubeflow_url.com/models/details/dev/sklearn-iris" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
2022-04-05 00:20:26,816 | kubeflow.kubeflow.crud_backend.authn | INFO | Handling request for user: [email protected]
2022-04-05 00:20:26,816 | kubeflow.kubeflow.crud_backend.csrf | INFO | Skipping CSRF check for safe method: GET
127.0.0.1 - - [05/Apr/2022:00:20:26 +0000] "GET /api/namespaces/dev/routes/sklearn-iris-predictor-default HTTP/1.1" 200 4226 "https://our_kubeflow_url.com/models/details/dev/sklearn-iris" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
2022-04-05 00:20:27,530 | kubeflow.kubeflow.crud_backend.authn | INFO | Handling request for user: [email protected]
2022-04-05 00:20:27,530 | kubeflow.kubeflow.crud_backend.csrf | INFO | Skipping CSRF check for safe method: GET
2022-04-05 00:20:27,542 | apps.common.routes.get | INFO | ['predictor']
2022-04-05 00:20:27,557 | kubeflow.kubeflow.crud_backend.errors.handlers | ERROR | Caught and unhandled Exception!
2022-04-05 00:20:27,557 | kubeflow.kubeflow.crud_backend.errors.handlers | ERROR | argument of type 'NoneType' is not iterable
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1513, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1499, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/src/apps/common/routes/get.py", line 27, in get_inference_service
"serviceLogs", get_inference_service_logs(inference_service)
File "/src/apps/common/routes/get.py", line 40, in get_inference_service_logs
component_pods_dict = utils.get_inference_service_pods(svc, components)
File "/src/apps/common/utils.py", line 42, in get_inference_service_pods
if KNATIVE_REVISION_LABEL not in pod.metadata.labels:
TypeError: argument of type 'NoneType' is not iterable
127.0.0.1 - - [05/Apr/2022:00:20:27 +0000] "GET /api/namespaces/dev/inferenceservices/sklearn-iris?logs=true&component=predictor HTTP/1.1" 500 133 "https://our_kubeflow_url.com/models/details/dev/sklearn-iris" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
2022-04-05 00:20:30,615 | kubeflow.kubeflow.crud_backend.authn | INFO | Handling request for user: [email protected]
2022-04-05 00:20:30,615 | kubeflow.kubeflow.crud_backend.csrf | INFO | Skipping CSRF check for safe method: GET
127.0.0.1 - - [05/Apr/2022:00:20:30 +0000] "GET /api/namespaces/dev/inferenceservices HTTP/1.1" 200 4740 "https://our_kubeflow_url.com/models/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
This last output seems to be where things went wrong.
if KNATIVE_REVISION_LABEL not in pod.metadata.labels:
TypeError: argument of type 'NoneType' is not iterable
Perhaps this might be an issue with RoleBinding ?
Here is how we created namespaces and added users to it :
- Create a profile/namespace
apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
name: dev
- Create default-editor service account
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: default-editor-access-binding-dev
subjects:
- kind: ServiceAccount
name: default-editor
namespace: dev
roleRef:
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
- Apply the RoleBinding for user (taken from kubeflow doc)
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
annotations:
role: edit
user: [email protected] # replace with the email of the user from your Active Directory case sensitive
name: user-userid-email-com-clusterrole-edit
# Ex: if the user email is [email protected] the name should be user-lalith-vaka-kp-org-clusterrole-edit
# Note: if the user email is [email protected] from your Active Directory, the name should be user-lalith-vaka-kp-org-clusterrole-edit
namespace: profileName # replace with the namespace/profile name that you are adding contributors to
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kubeflow-edit
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: [email protected] # replace with the email of the user from your Active Directory case sensitive
- Apply Auth policy (again, taken from kubeflow doc)
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
annotations:
role: edit
user: [email protected] # replace with the email of the user from your Active Directory case sensitive
name: user-userid-email-com-clusterrole-edit
namespace: profileName # replace with the namespace/profile name that you are adding contributors to
spec:
action: ALLOW
rules:
- when:
- key: request.headers[kubeflow-userid] # for GCP, use x-goog-authenticated-user-email instead of kubeflow-userid for authentication purpose
values:
- accounts.google.com:[email protected] # replace with the email of the user from your Active Directory case sensitive
The values for steps 3 and 4 are set according to the email of the user, profile etc. Everything besides model server logs work (notebook creation, pipeline, PVC etc)
Hey, the error in kfserving-models-web-app-7884f597cf-bgpk4
is not related to RBAC
but something else.
Here is the code which does that check https://github.com/kserve/models-web-app/blob/b842420d45f0bcf596a219efc116e8c66a727683/backend/apps/common/utils.py#L47
Can you describe the pod and paste the output?
@goswamig Command :
kubectl describe pod kfserving-models-web-app-7884f597cf-bgpk4 -n kubeflow
Output :
Name: kfserving-models-web-app-7884f597cf-bgpk4
Namespace: kubeflow
Priority: 0
Node: ip-xxx-xxx-xx-xxx.us-east-2.compute.internal/xxx.xxx.xx.xxx
Start Time: Tue, 22 Mar 2022 19:25:52 +0000
Labels: app=kfserving
app.kubernetes.io/component=kfserving-models-web-app
app.kubernetes.io/name=kfserving
istio.io/rev=default
kustomize.component=kfserving-models-web-app
pod-template-hash=7884f597cf
security.istio.io/tlsMode=istio
service.istio.io/canonical-name=kfserving
service.istio.io/canonical-revision=latest
Annotations: kubectl.kubernetes.io/default-logs-container: kfserving-models-web-app
kubernetes.io/psp: eks.privileged
prometheus.io/path: /stats/prometheus
prometheus.io/port: 15020
prometheus.io/scrape: true
sidecar.istio.io/inject: true
sidecar.istio.io/status:
{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-...
Status: Running
IP: xxx.xxx.xx.xxx
IPs:
IP: xxx.xxx.xx.xxx
Controlled By: ReplicaSet/kfserving-models-web-app-7884f597cf
Init Containers:
istio-init:
Container ID: docker://ac728ab75e1594e0f1a21eea389aa2a5b66b76d985f54c789fc8795672e95680
Image: docker.io/istio/proxyv2:1.9.6
Image ID: docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
Port: <none>
Host Port: <none>
Args:
istio-iptables
-p
15001
-z
15006
-u
1337
-m
REDIRECT
-i
*
-x
-b
*
-d
15090,15021,15020
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 22 Mar 2022 19:25:55 +0000
Finished: Tue, 22 Mar 2022 19:25:55 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 40Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xl7cg (ro)
Containers:
kfserving-models-web-app:
Container ID: docker://e6b80d44ba0342e862c21a4aed4aa2903505cc534b716b9e200e72432e835283
Image: kfserving/models-web-app:v0.6.1
Image ID: docker-pullable://kfserving/models-web-app@sha256:f322d6ffdfa3b5a6fbbe3910c4cf4b5ad85871f20dbd2ed854d65f90e1fd8aba
Port: 5000/TCP
Host Port: 0/TCP
State: Running
Started: Tue, 22 Mar 2022 19:25:56 +0000
Ready: True
Restart Count: 0
Environment Variables from:
kfserving-models-web-app-config-mtgm8bbd98 ConfigMap Optional: false
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xl7cg (ro)
istio-proxy:
Container ID: docker://839a6c4dec08563abb79edcd8c7a5da89d3404292c8c7fdff2c0922b79540948
Image: docker.io/istio/proxyv2:1.9.6
Image ID: docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--serviceCluster
kfserving.$(POD_NAMESPACE)
--proxyLogLevel=warning
--proxyComponentLogLevel=misc:error
--log_output_level=default:info
--concurrency
2
State: Running
Started: Tue, 22 Mar 2022 19:25:57 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 40Mi
Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
Environment:
JWT_POLICY: third-party-jwt
PILOT_CERT_PROVIDER: istiod
CA_ADDR: istiod.istio-system.svc:15012
POD_NAME: kfserving-models-web-app-7884f597cf-bgpk4 (v1:metadata.name)
POD_NAMESPACE: kubeflow (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
SERVICE_ACCOUNT: (v1:spec.serviceAccountName)
HOST_IP: (v1:status.hostIP)
CANONICAL_SERVICE: (v1:metadata.labels['service.istio.io/canonical-name'])
CANONICAL_REVISION: (v1:metadata.labels['service.istio.io/canonical-revision'])
PROXY_CONFIG: {"tracing":{}}
ISTIO_META_POD_PORTS: [
{"containerPort":5000,"protocol":"TCP"}
]
ISTIO_META_APP_CONTAINERS: kfserving-models-web-app
ISTIO_META_CLUSTER_ID: Kubernetes
ISTIO_META_INTERCEPTION_MODE: REDIRECT
ISTIO_METAJSON_ANNOTATIONS: {"kubernetes.io/psp":"eks.privileged","sidecar.istio.io/inject":"true"}
ISTIO_META_WORKLOAD_NAME: kfserving-models-web-app
ISTIO_META_OWNER: kubernetes://apis/apps/v1/namespaces/kubeflow/deployments/kfserving-models-web-app
ISTIO_META_MESH_ID: cluster.local
TRUST_DOMAIN: cluster.local
Mounts:
/etc/istio/pod from istio-podinfo (rw)
/etc/istio/proxy from istio-envoy (rw)
/var/lib/istio/data from istio-data (rw)
/var/run/secrets/istio from istiod-ca-cert (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xl7cg (ro)
/var/run/secrets/tokens from istio-token (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
istio-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
istio-podinfo:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
limits.cpu -> cpu-limit
requests.cpu -> cpu-request
istio-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 43200
istiod-ca-cert:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: istio-ca-root-cert
Optional: false
kube-api-access-xl7cg:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
I also tried deleting the pod so it recreates it but still same result, models web app pod is running but no logs shown.
Here is the description of an inference pod
apiVersion: "serving.kubeflow.org/v1beta1"
kind: "InferenceService"
metadata:
name: "torchscript-cifar"
spec:
predictor:
triton:
storageUri: "gs://kfserving-examples/models/torchscript"
runtimeVersion: 21.08-py3
env:
- name: OMP_NUM_THREADS
value: "1"
describe :
Name: torchscript-cifar-predictor-default-00001-deployment-579fdpzzbj
Namespace: dev
Priority: 0
Node: ip-xxx-xxx-xx-xxx.us-xxxx-x.compute.internal/xxx.xxx.xx.xxx
Start Time: Wed, 06 Apr 2022 12:47:26 +0000
Labels: app=torchscript-cifar-predictor-default-00001
component=predictor
istio.io/rev=default
pod-template-hash=579fdc9658
security.istio.io/tlsMode=istio
service.istio.io/canonical-name=torchscript-cifar-predictor-default
service.istio.io/canonical-revision=torchscript-cifar-predictor-default-00001
serving.knative.dev/configuration=torchscript-cifar-predictor-default
serving.knative.dev/configurationGeneration=1
serving.knative.dev/configurationUID=bc0ebece-d60b-4e50-a830-1bc893088a4a
serving.knative.dev/revision=torchscript-cifar-predictor-default-00001
serving.knative.dev/revisionUID=e3bb9a41-5ab1-42fe-9955-5cbbf6d9b609
serving.knative.dev/service=torchscript-cifar-predictor-default
serving.knative.dev/serviceUID=25b5d802-3a0e-4a6f-8b64-aa70bf0d89a3
serving.kubeflow.org/inferenceservice=torchscript-cifar
Annotations: autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
autoscaling.knative.dev/minScale: 1
internal.serving.kubeflow.org/storage-initializer-sourceuri: gs://kfserving-examples/models/torchscript
kubernetes.io/psp: eks.privileged
prometheus.io/path: /stats/prometheus
prometheus.io/port: 15020
prometheus.io/scrape: true
serving.knative.dev/creator: system:serviceaccount:kubeflow:default
sidecar.istio.io/status:
{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-...
Status: Running
IP: xxx.xxx.xx.xxx
IPs:
IP: xxx.xxx.xx.xxx
Controlled By: ReplicaSet/torchscript-cifar-predictor-default-00001-deployment-579fdc9658
Init Containers:
storage-initializer:
Container ID: docker://d55da9dde917af69abcaadaa28464351257bb51b0b7c3f133e7913c12b294eb4
Image: kfserving/storage-initializer:v0.6.1
Image ID: docker-pullable://kfserving/storage-initializer@sha256:81b654574aec1dbce37d6e5b8d5d58b5d820fdf17e35b04cd52b1c805595fc37
Port: <none>
Host Port: <none>
Args:
gs://kfserving-examples/models/torchscript
/mnt/models
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 06 Apr 2022 12:47:27 +0000
Finished: Wed, 06 Apr 2022 12:47:28 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 100m
memory: 100Mi
Environment: <none>
Mounts:
/mnt/models from kfserving-provision-location (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j5h5h (ro)
istio-init:
Container ID: docker://5ec712c6062b13d3517d600395dbce6e178fa9fe6775b0a4014159be96596f7d
Image: docker.io/istio/proxyv2:1.9.6
Image ID: docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
Port: <none>
Host Port: <none>
Args:
istio-iptables
-p
15001
-z
15006
-u
1337
-m
REDIRECT
-i
*
-x
-b
*
-d
15090,15021,15020
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 06 Apr 2022 12:47:29 +0000
Finished: Wed, 06 Apr 2022 12:47:29 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 40Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j5h5h (ro)
Containers:
kfserving-container:
Container ID: docker://94d0a87961741aae936b52f45c709a86f7c99dd78359b761009b9bf8d5b1eaed
Image: nvcr.io/nvidia/tritonserver@sha256:6a1ef84e93327fef53ac11379fa4e53082c9e228de5db24e0abdfc5aff5676cf
Image ID: docker-pullable://nvcr.io/nvidia/tritonserver@sha256:6a1ef84e93327fef53ac11379fa4e53082c9e228de5db24e0abdfc5aff5676cf
Port: 8080/TCP
Host Port: 0/TCP
Args:
tritonserver
--model-store=/mnt/models
--grpc-port=9000
--http-port=8080
--allow-grpc=true
--allow-http=true
State: Running
Started: Wed, 06 Apr 2022 12:50:37 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 2Gi
Requests:
cpu: 1
memory: 2Gi
Environment:
OMP_NUM_THREADS: 1
PORT: 8080
K_REVISION: torchscript-cifar-predictor-default-00001
K_CONFIGURATION: torchscript-cifar-predictor-default
K_SERVICE: torchscript-cifar-predictor-default
Mounts:
/mnt/models from kfserving-provision-location (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j5h5h (ro)
queue-proxy:
Container ID: docker://088fcd1e9e419ea6878f4140bca29e094a038f02b2e77a96636453f022476ea2
Image: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:0b8e031170354950f3395876961452af1c62f7ab5161c9e71[8673](https://gitlab.com/geown/artificial-intelligence/applications/kubeflow/-/jobs/2299055041#L8673)92c11881962
Image ID: docker-pullable://gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:0b8e031170354950f3395876961452af1c62f7ab5161c9e71867392c11881962
Ports: 8022/TCP, 9090/TCP, 9091/TCP, 8012/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP
State: Running
Started: Wed, 06 Apr 2022 12:50:37 +0000
Ready: True
Restart Count: 0
Requests:
cpu: 25m
Readiness: http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=1s #success=1 #failure=3
Startup: exec [/ko-app/queue -probe-timeout 10m0s] delay=0s timeout=600s period=1s #success=1 #failure=1
Environment:
SERVING_NAMESPACE: dev
SERVING_SERVICE: torchscript-cifar-predictor-default
SERVING_CONFIGURATION: torchscript-cifar-predictor-default
SERVING_REVISION: torchscript-cifar-predictor-default-00001
QUEUE_SERVING_PORT: 8012
CONTAINER_CONCURRENCY: 0
REVISION_TIMEOUT_SECONDS: 300
SERVING_POD: torchscript-cifar-predictor-default-00001-deployment-579fdpzzbj (v1:metadata.name)
SERVING_POD_IP: (v1:status.podIP)
SERVING_LOGGING_CONFIG:
SERVING_LOGGING_LEVEL:
SERVING_REQUEST_LOG_TEMPLATE: {"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header "X-B3-Traceid"}}"}
SERVING_ENABLE_REQUEST_LOG: false
SERVING_REQUEST_METRICS_BACKEND: prometheus
TRACING_CONFIG_BACKEND: none
TRACING_CONFIG_ZIPKIN_ENDPOINT:
TRACING_CONFIG_STACKDRIVER_PROJECT_ID:
TRACING_CONFIG_DEBUG: false
TRACING_CONFIG_SAMPLE_RATE: 0.1
USER_PORT: 8080
SYSTEM_NAMESPACE: knative-serving
METRICS_DOMAIN: knative.dev/internal/serving
SERVING_READINESS_PROBE: {"tcpSocket":{"port":8080,"host":"127.0.0.1"},"successThreshold":1}
ENABLE_PROFILING: false
SERVING_ENABLE_PROBE_REQUEST_LOG: false
METRICS_COLLECTOR_ADDRESS:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j5h5h (ro)
istio-proxy:
Container ID: docker://671ba8badeeb542c45bebb70d2f6904a8c3c568279bb5d36cddd25fa1803b6b0
Image: docker.io/istio/proxyv2:1.9.6
Image ID: docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--serviceCluster
torchscript-cifar-predictor-default-00001.$(POD_NAMESPACE)
--proxyLogLevel=warning
--proxyComponentLogLevel=misc:error
--log_output_level=default:info
--concurrency
2
State: Running
Started: Wed, 06 Apr 2022 12:50:38 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 40Mi
Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
Environment:
JWT_POLICY: third-party-jwt
PILOT_CERT_PROVIDER: istiod
CA_ADDR: istiod.istio-system.svc:15012
POD_NAME: torchscript-cifar-predictor-default-00001-deployment-579fdpzzbj (v1:metadata.name)
POD_NAMESPACE: dev (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
SERVICE_ACCOUNT: (v1:spec.serviceAccountName)
HOST_IP: (v1:status.hostIP)
CANONICAL_SERVICE: (v1:metadata.labels['service.istio.io/canonical-name'])
CANONICAL_REVISION: (v1:metadata.labels['service.istio.io/canonical-revision'])
PROXY_CONFIG: {"tracing":{}}
ISTIO_META_POD_PORTS: [
{"name":"user-port","containerPort":8080,"protocol":"TCP"}
,{"name":"http-queueadm","containerPort":8022,"protocol":"TCP"}
,{"name":"http-autometric","containerPort":9090,"protocol":"TCP"}
,{"name":"http-usermetric","containerPort":9091,"protocol":"TCP"}
,{"name":"queue-port","containerPort":8012,"protocol":"TCP"}
]
ISTIO_META_APP_CONTAINERS: kfserving-container,queue-proxy
ISTIO_META_CLUSTER_ID: Kubernetes
ISTIO_META_INTERCEPTION_MODE: REDIRECT
ISTIO_METAJSON_ANNOTATIONS: {"autoscaling.knative.dev/class":"kpa.autoscaling.knative.dev","autoscaling.knative.dev/minScale":"1","internal.serving.kubeflow.org/storage-initializer-sourceuri":"gs://kfserving-examples/models/torchscript","kubernetes.io/psp":"eks.privileged","serving.knative.dev/creator":"system:serviceaccount:kubeflow:default"}
ISTIO_META_WORKLOAD_NAME: torchscript-cifar-predictor-default-00001-deployment
ISTIO_META_OWNER: kubernetes://apis/apps/v1/namespaces/dev/deployments/torchscript-cifar-predictor-default-00001-deployment
ISTIO_META_MESH_ID: cluster.local
TRUST_DOMAIN: cluster.local
ISTIO_KUBE_APP_PROBERS: {"/app-health/queue-proxy/readyz":{"httpGet":{"path":"/","port":8012,"scheme":"HTTP","httpHeaders":[{"name":"K-Network-Probe","value":"queue"}]},"timeoutSeconds":1}}
Mounts:
/etc/istio/pod from istio-podinfo (rw)
/etc/istio/proxy from istio-envoy (rw)
/var/lib/istio/data from istio-data (rw)
/var/run/secrets/istio from istiod-ca-cert (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j5h5h (ro)
/var/run/secrets/tokens from istio-token (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
istio-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
istio-podinfo:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
limits.cpu -> cpu-limit
requests.cpu -> cpu-request
istio-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 43200
istiod-ca-cert:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: istio-ca-root-cert
Optional: false
kube-api-access-j5h5h:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
kfserving-provision-location:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m20s default-scheduler Successfully assigned dev/torchscript-cifar-predictor-default-00001-deployment-579fdpzzbj to ip-xxx.xxx.xx.xxx.us-xxxx-x.compute.internal
Normal Pulled 4m19s kubelet Container image "kfserving/storage-initializer:v0.6.1" already present on machine
Normal Created 4m19s kubelet Created container storage-initializer
Normal Started 4m19s kubelet Started container storage-initializer
Normal Pulling 4m17s kubelet Pulling image "docker.io/istio/proxyv2:1.9.6"
Normal Pulled 4m17s kubelet Successfully pulled image "docker.io/istio/proxyv2:1.9.6" in 223.710995ms
Normal Created 4m17s kubelet Created container istio-init
Normal Started 4m17s kubelet Started container istio-init
Normal Pulling 4m16s kubelet Pulling image "nvcr.io/nvidia/tritonserver@sha256:6a1ef84e93327fef53ac11379fa4e53082c9e228de5db24e0abdfc5aff5676cf"
Normal Pulled 85s kubelet Successfully pulled image "nvcr.io/nvidia/tritonserver@sha256:6a1ef84e93327fef53ac11379fa4e53082c9e228de5db24e0abdfc5aff5676cf" in 2m51.7544345s
Normal Created 69s kubelet Created container kfserving-container
Normal Started 69s kubelet Started container kfserving-container
Normal Pulled 69s kubelet Container image "gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:0b8e031170354950f339562" already present on machine
Normal Created 69s kubelet Created container queue-proxy
Normal Started 69s kubelet Started container queue-proxy
Normal Pulling 69s kubelet Pulling image "docker.io/istio/proxyv2:1.9.6"
Normal Pulled 68s kubelet Successfully pulled image "docker.io/istio/proxyv2:1.9.6" in 232.847416ms
Normal Created 68s kubelet Created container istio-proxy
Normal Started 68s kubelet Started container istio-proxy
@rrrkharse Can you share your output from describe of webapp pod ? and also kserving version that is deployed in 1.4.1 ?
Pod description
$kbl describe pod -n kubeflow kfserving-models-web-app-7884f597cf-cllp5
Name: kfserving-models-web-app-7884f597cf-cllp5
Namespace: kubeflow
Priority: 0
Node: ip-192-168-29-17.us-west-2.compute.internal/192.168.29.17
Start Time: Wed, 23 Mar 2022 14:47:12 -0700
Labels: app=kfserving
app.kubernetes.io/component=kfserving-models-web-app
app.kubernetes.io/name=kfserving
istio.io/rev=default
kustomize.component=kfserving-models-web-app
pod-template-hash=7884f597cf
security.istio.io/tlsMode=istio
service.istio.io/canonical-name=kfserving
service.istio.io/canonical-revision=latest
Annotations: kubectl.kubernetes.io/default-logs-container: kfserving-models-web-app
kubernetes.io/psp: eks.privileged
prometheus.io/path: /stats/prometheus
prometheus.io/port: 15020
prometheus.io/scrape: true
sidecar.istio.io/inject: true
sidecar.istio.io/status:
{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-...
Status: Running
IP: 192.168.27.173
IPs:
IP: 192.168.27.173
Controlled By: ReplicaSet/kfserving-models-web-app-7884f597cf
Init Containers:
istio-init:
Container ID: docker://487be96f9cc0d464ca525119b8d086dfa9f372c6228330431ce0477aaaebd522
Image: docker.io/istio/proxyv2:1.9.6
Image ID: docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
Port: <none>
Host Port: <none>
Args:
istio-iptables
-p
15001
-z
15006
-u
1337
-m
REDIRECT
-i
*
-x
-b
*
-d
15090,15021,15020
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 26 Mar 2022 19:03:54 -0700
Finished: Sat, 26 Mar 2022 19:03:54 -0700
Ready: True
Restart Count: 1
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 40Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lsqmk (ro)
Containers:
kfserving-models-web-app:
Container ID: docker://f797d131d45fd0a0ac48337695c2ccefbb1458bd2c24aaaa8fe0bb20383e8e8c
Image: kfserving/models-web-app:v0.6.1
Image ID: docker-pullable://kfserving/models-web-app@sha256:f322d6ffdfa3b5a6fbbe3910c4cf4b5ad85871f20dbd2ed854d65f90e1fd8aba
Port: 5000/TCP
Host Port: 0/TCP
State: Running
Started: Sat, 26 Mar 2022 19:03:56 -0700
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Wed, 23 Mar 2022 14:47:26 -0700
Finished: Sat, 26 Mar 2022 19:01:54 -0700
Ready: True
Restart Count: 1
Environment Variables from:
kfserving-models-web-app-config-mtgm8bbd98 ConfigMap Optional: false
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lsqmk (ro)
istio-proxy:
Container ID: docker://bd1285c601c575b262a649855d605aa38faac634e4e1dd5bbe9f66bb8d700f2f
Image: docker.io/istio/proxyv2:1.9.6
Image ID: docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--serviceCluster
kfserving.$(POD_NAMESPACE)
--proxyLogLevel=warning
--proxyComponentLogLevel=misc:error
--log_output_level=default:info
--concurrency
2
State: Running
Started: Sat, 26 Mar 2022 19:03:57 -0700
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Wed, 23 Mar 2022 14:47:27 -0700
Finished: Sat, 26 Mar 2022 19:01:54 -0700
Ready: True
Restart Count: 1
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 40Mi
Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
Environment:
JWT_POLICY: third-party-jwt
PILOT_CERT_PROVIDER: istiod
CA_ADDR: istiod.istio-system.svc:15012
POD_NAME: kfserving-models-web-app-7884f597cf-cllp5 (v1:metadata.name)
POD_NAMESPACE: kubeflow (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
SERVICE_ACCOUNT: (v1:spec.serviceAccountName)
HOST_IP: (v1:status.hostIP)
CANONICAL_SERVICE: (v1:metadata.labels['service.istio.io/canonical-name'])
CANONICAL_REVISION: (v1:metadata.labels['service.istio.io/canonical-revision'])
PROXY_CONFIG: {"tracing":{}}
ISTIO_META_POD_PORTS: [
{"containerPort":5000,"protocol":"TCP"}
]
ISTIO_META_APP_CONTAINERS: kfserving-models-web-app
ISTIO_META_CLUSTER_ID: Kubernetes
ISTIO_META_INTERCEPTION_MODE: REDIRECT
ISTIO_METAJSON_ANNOTATIONS: {"kubernetes.io/psp":"eks.privileged","sidecar.istio.io/inject":"true"}
ISTIO_META_WORKLOAD_NAME: kfserving-models-web-app
ISTIO_META_OWNER: kubernetes://apis/apps/v1/namespaces/kubeflow/deployments/kfserving-models-web-app
ISTIO_META_MESH_ID: cluster.local
TRUST_DOMAIN: cluster.local
Mounts:
/etc/istio/pod from istio-podinfo (rw)
/etc/istio/proxy from istio-envoy (rw)
/var/lib/istio/data from istio-data (rw)
/var/run/secrets/istio from istiod-ca-cert (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lsqmk (ro)
/var/run/secrets/tokens from istio-token (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
istio-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
istio-podinfo:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
limits.cpu -> cpu-limit
requests.cpu -> cpu-request
istio-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 43200
istiod-ca-cert:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: istio-ca-root-cert
Optional: false
kube-api-access-lsqmk:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Kserve version from this commit: https://github.com/kubeflow/manifests/tree/ca48b5eb6b77e74c957bf4522e47ae82ed78cc4a/apps/kfserving/upstream/overlays/kubeflow
Any update @goswamig @rrrkharse ?
Hey @AlexandreBrown , so what @rrrkharse has running has different UI than what you're running.
If you noticed in @rrrkharse he does not have Metrics
tab. Are you running the same version as @rrrkharse with same instructions. ?
@goswamig I am running the same setup. The metrics tab is available once you also setup Grafana. In our previous installation without cognito (1.4.1 manifest, the Grafana setup didn't interfere with the logs tab therefore I doubt this is the issue). https://www.kubeflow.org/docs/external-add-ons/kserve/webapp/#metrics
-
Download the monitoring-core.yaml and monitoring-metrics-prometheus.yaml files from the Knative 0.18 release (https://github.com/knative/serving/releases/tag/v0.18.0)
-
kubectl apply -f monitoring-core.yaml
-
kubectl apply -f monitoring-metrics-prometheus.yaml
-
kubectl apply -f grafana-virtual-service.yaml
-
kubectl apply -f grafana-config-map.yaml
-
kubectl apply -f models-web-app-aut-policy.yaml
-
kubectl rollout restart deployment/grafana -n knative-monitoring
@goswamig @rrrkharse I am using this commit from the aws main branch.
AWS_MANIFESTS_BUILD=038015b47104dcff5b0c596893512e4e75f5a7bd
That's maybe the only difference I could see.
Could you try to reproduce on this aws main branch commit? This would help a lot troubleshoot the source of the issue.
@AlexandreBrown, thanks for providing more details, we will try to reproduce it. Sorry for closing the issue inadvertently.
Is there a way u can delete and re-deploy the web-app deployment? I tried today and couldn't reproduce it.
I deleted the deployment and re-launch the kubeflow install so it recreates it, still same issue :(
do u still have same error in web-deployment pod ? Can u send the full log from web-deployment again?
I believe you can see the log of in console using kubectl logs ....
but its failing to load in web-app, right?. So I am suspecting that web-app pod is bailing out in your case with missing pod level at least that was the error in previous log which you've posted.
@goswamig Yes still the same error from the kfserving model web app.
Here is the logs (last 30 lines of the logs)
logs.txt
I believe you can see the log of in console using kubectl logs .... but its failing to load in web-app, right?
Exactly, I can see the InferenceService logs when using Kubectl but not via the model web app UI.
...
EDIT : I'm not getting any logs update when doing a request. What I mean is the logs I see is the logs when the model server was first spawned but when sending request to the model server, kubectl logs ... -c kfserving-container
doesn't show anything new.
Basically what I'm trying to say is that the kubectl logs
of the kfserving-container
is not updated so maybe it's not working via the CLI afterall (meaning the issues is not just from the model app UI perhaps).
kubectl logs ... -c istio-proxy
shows logs about requests (like timestamps, id and status code) but it's not this type of logs that is reflected by the logs model web app anyways.
@AlexandreBrown but previously you mentioned here that you can see the logs in terminal by running command
kubectl logs -n dev sklearn-iris-predictor-default-00001-deployment-5558697cb-t8q7b -c kfserving-container
Basically this is the log you'll see in web-app.
did anything change since then?
@goswamig Nothing has changed, I previously mentioned that I was seeing the logs but the result was like the photos from my last message.
So the logs of the model server is indeed accessible but it only shows what is printed when the model server successfully becomes alive. I had not tried to see the logs after sending requests, I tried today then realized that the logs that I thought were normal from kubectl
was actually logs not showing any activity passed initialization.
So yes kubectl logs shows something but this something does not change passed the initialization logs.
so I tried again today and logs we see is follow, this is same you should see in the pod.
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/randomized_l1.py:580: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
eps=4 * np.finfo(np.float).eps, n_jobs=None,
[I 220424 02:03:46 kfserver:151] Registering model: sklearn-iris
[I 220424 02:03:46 kfserver:121] Setting asyncio max_workers as 5
[I 220424 02:03:46 kfserver:128] Listening on port 8080
[I 220424 02:03:46 kfserver:130] Will fork 1 workers
[I 220429 21:15:35 web:2243] 200 POST /v1/models/sklearn-iris:predict (127.0.0.1) 1.38ms
[I 220429 21:15:50 web:2243] 200 POST /v1/models/sklearn-iris:predict (127.0.0.1) 0.84ms
[I 220429 21:16:00 web:2243] 200 POST /v1/models/sklearn-iris:predict (127.0.0.1) 0.76ms
[I 220429 21:17:50 web:2243] 200 POST /v1/models/sklearn-iris:predict (127.0.0.1) 0.74ms
@goswamig My logs are different than yours,
Mine stop at the following :
I0428 02:43:54.056516 1 grpc_server.cc:3983] Started GRPCInferenceService at 0.0.0.0:9000
I0428 02:43:54.056751 1 http_server.cc:2717] Started HTTPService at 0.0.0.0:8080
I0428 02:43:54.097922 1 http_server.cc:2736] Started Metrics Service at 0.0.0.0:8002
Then nothing new is logged even after new requests.
So perhaps this might suggest an issue that is deeper than just Kubeflow UI
yeah so are u able to get the response back ?
@goswamig Yes the inference works, we can query the model server and get a response back (HTTP 200) but the logs are not updated. When we run kubectl logs, nothing changes from the logs in my last message, even after thousands of requests.
Hi Alex, does this issue still exist?
Do you face this issue with 1.5 as well? I have been using 1.5 in past few months, the logs show up on the dashboard
Hi @surajkota , the issue remains for us with our current KF 1.4.1 install.
I have not tried inference with KF 1.5, will report back after trying it.
Just tested Kubeflow 1.6.1 using aws repo on a test cluster (rds-s3-cognito
) and I can see the logs.
Closing this issue as a "cannot reproduce".
Thanks