serving Knative service status to "Unknown" and "Uninitialized"

What version of Knative?

1.17.0
net-istio: 1.17.0
istio: 1.24.2

Expected Behavior

The knative service status should be “Ready” instead of hanging in “Unknown” state.

Actual Behavior

When I deploy a knative service, in a EKS cluster, it remains in “Unknown” status until the istio ingress-controllers are restarted, even if the application can be reached. It then switches to “Ready”, and the next application deployed will be in “Unknown” status, and so on.

 kubectl get ksvc gsvc-serving-db07943b -n eb7d5189

NAME                    URL                                                    LATESTCREATED                 LATESTREADY                   READY     REASON

gsvc-serving-db07943b   http://test-eb7d5189.serverless-dev.xyz.crashcourse.com   gsvc-serving-db07943b-00001   gsvc-serving-db07943b-00001   Unknown   Uninitialized

The application is exposed with a loadbalancer and reachable

curl https://test-eb7d5189.serverless-dev.xyz.crashcourse.com

Hello World!

Here are the details of the knative service status:

kubectl get ksvc -n eb7d5189 gsvc-serving-db07943b -ojsonpath='{.status}' | jq 
{
  "address": {
    "url": "http://gsvc-serving-db07943b.eb7d5189.svc.cluster.local"
  },
  "conditions": [
    {
      "lastTransitionTime": "2025-02-06T13:43:16Z",
      "status": "True",
      "type": "ConfigurationsReady"
    },
    {
      "lastTransitionTime": "2025-02-06T13:43:16Z",
      "message": "Waiting for load balancer to be ready",
      "reason": "Uninitialized",
      "status": "Unknown",
      "type": "Ready"
    },
    {
      "lastTransitionTime": "2025-02-06T13:43:16Z",
      "message": "Waiting for load balancer to be ready",
      "reason": "Uninitialized",
      "status": "Unknown",
      "type": "RoutesReady"
    }
  ],
  "latestCreatedRevisionName": "gsvc-serving-db07943b-00001",
  "latestReadyRevisionName": "gsvc-serving-db07943b-00001",
  "observedGeneration": 1,
  "traffic": [
    {
      "latestRevision": true,
      "percent": 100,
      "revisionName": "gsvc-serving-db07943b-00001"
    }
  ],
  "url": "http://test-eb7d5189.serverless-dev.xyz.crashcourse.com"
}

So, for the load-balancing I use an AWS NLB, and everything seems to be ok, all the targets (15021, 443, 80) are healthy.

I also noticed a couple of logs, probably related to the issue.

{
    "severity": "ERROR",
    "timestamp": "2025-02-06T13:25:22.20239663Z",
    "logger": "net-istio-controller.istio-ingress-controller",
    "caller": "status/status.go:421",
    "message": "Probing of https://test-eb7d5189.serverless-dev.xyz.crashcourse.com:443 failed, IP: 100.64.174.122:443, ready: false, error: error roundtripping https://test-eb7d5189.serverless-dev.xyz.crashcourse.com:443/healthz: read tcp 100.64.162.193:36768->100.64.174.122:443: read: connection reset by peer (depth: 0)",
    "commit": "4dff29e-dirty",
    "knative.dev/controller": "knative.dev.net-istio.pkg.reconciler.ingress.Reconciler",
    "knative.dev/kind": "networking.internal.knative.dev.Ingress",
    "knative.dev/traceid": "de3b3adb-b689-4a4c-b4d4-39c22a0911ba",
    "knative.dev/key": "eb7d5189/gsvc-serving-db07943b",
    "stacktrace": "knative.dev/networking/pkg/status.(*Prober).processWorkItem\n\tknative.dev/[email protected]/pkg/status/status.go:421\nknative.dev/networking/pkg/status.(*Prober).Start.func1\n\tknative.dev/[email protected]/pkg/status/status.go:306"
}

and

{
    "severity": "WARNING",
    "timestamp": "2025-02-06T13:25:22.997317236Z",
    "logger": "controller",
    "caller": "route/reconcile_resources.go:227",
    "message": "Failed to update k8s service",
    "commit": "6265a8e",
    "knative.dev/pod": "controller-85c449cd99-97hgw",
    "knative.dev/controller": "knative.dev.serving.pkg.reconciler.route.Reconciler",
    "knative.dev/kind": "serving.knative.dev.Route",
    "knative.dev/traceid": "e6e9c300-658b-4feb-8b59-e2bf6fa95bd1",
    "knative.dev/key": "eb7d5189/gsvc-serving-db07943b",
    "error": "failed to fetch loadbalancer domain/IP from ingress status"
}

I'd also like to point out that I looked at the route and the ingress, the outputs of which are as follows

route

{
  "address": {
    "url": "http://gsvc-serving-db07943b.eb7d5189.svc.cluster.local"
  },
  "conditions": [
    {
      "lastTransitionTime": "2025-02-06T13:43:16Z",
      "status": "True",
      "type": "AllTrafficAssigned"
    },
    {
      "lastTransitionTime": "2025-02-06T15:03:04Z",
      "message": "Certificate route-55cecb6b-26fc-4fda-8e42-1d7d9c8fdd2b is not ready downgrade HTTP.",
      "reason": "HTTPDowngrade",
      "status": "True",
      "type": "CertificateProvisioned"
    },
    {
      "lastTransitionTime": "2025-02-06T13:43:16Z",
      "message": "Waiting for load balancer to be ready",
      "reason": "Uninitialized",
      "status": "Unknown",
      "type": "IngressReady"
    },
    {
      "lastTransitionTime": "2025-02-06T13:43:16Z",
      "message": "Waiting for load balancer to be ready",
      "reason": "Uninitialized",
      "status": "Unknown",
      "type": "Ready"
    }
  ],
  "observedGeneration": 1,
  "traffic": [
    {
      "latestRevision": true,
      "percent": 100,
      "revisionName": "gsvc-serving-db07943b-00001"
    }
  ],
  "url": "http://test-eb7d5189.serverless-dev.xyz.crashcourse.com"
}

ingress

{
  "conditions": [
    {
      "lastTransitionTime": "2025-02-06T13:43:16Z",
      "message": "Waiting for load balancer to be ready",
      "reason": "Uninitialized",
      "status": "Unknown",
      "type": "LoadBalancerReady"
    },
    {
      "lastTransitionTime": "2025-02-06T13:43:16Z",
      "status": "True",
      "type": "NetworkConfigured"
    },
    {
      "lastTransitionTime": "2025-02-06T13:43:16Z",
      "message": "Waiting for load balancer to be ready",
      "reason": "Uninitialized",
      "status": "Unknown",
      "type": "Ready"
    }
  ],
  "observedGeneration": 1
}

strange findings

Logs from istiod

2025-02-03T10:57:39.622954Z	info	ads	Push debounce stable[112] 1 for config Secret/eb7d5189/gsvc-pull-2f42fda6-serving-f61c3f70: 100.240948ms since last change, 100.240879ms since last push, full=false

2025-02-03T10:57:39.732479Z	info	model	Incremental push, service gsvc-serving-db07943b-00001-private.eb7d5189.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints

2025-02-03T10:57:39.756497Z	info	model	Full push, new service eb7d5189/gsvc-serving-db07943b-00001.eb7d5189.svc.cluster.local
2025-02-03T10:57:39.924255Z	info	ads	Push debounce stable[113] 5 for config ServiceEntry/eb7d5189/gsvc-serving-db07943b-00001-private.eb7d5189.svc.cluster.local and 1 more configs: 100.548746ms since last change, 200.632268ms since last push, full=true
        "outbound|443||gsvc-serving-db07943b-00001-private.eb7d5189.svc.cluster.local": {},
        "outbound|8012||gsvc-serving-db07943b-00001-private.eb7d5189.svc.cluster.local": {},
        "outbound|8022||gsvc-serving-db07943b-00001-private.eb7d5189.svc.cluster.local": {},
        "outbound|80||gsvc-serving-db07943b-00001-private.eb7d5189.svc.cluster.local": {},
        "outbound|9090||gsvc-serving-db07943b-00001-private.eb7d5189.svc.cluster.local": {},
        "outbound|9091||gsvc-serving-db07943b-00001-private.eb7d5189.svc.cluster.local": {}
2025-02-03T10:58:02.542487Z	info	model	Full push, new service eb7d5189/gsvc-serving-db07943b-00001-private.eb7d5189.svc.cluster.local
2025-02-03T10:58:02.722779Z	info	ads	Push debounce stable[114] 3 for config ServiceEntry/eb7d5189/gsvc-serving-db07943b-00001-private.eb7d5189.svc.cluster.local and 1 more configs: 100.661715ms since last change, 180.224615ms since last push, full=true
2025-02-03T10:58:02.961683Z	info	ads	Push debounce stable[115] 3 for config ServiceEntry/eb7d5189/gsvc-serving-db07943b.eb7d5189.svc.cluster.local and 2 more configs: 100.299834ms since last change, 160.890523ms since last push, full=true

Services before ingress-controller restart

kubectl get svc -n eb7d5189    

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

gsvc-serving-db07943b ExternalName <none> test-eb7d5189.serverless-dev.xyz.crashcourse.com 80/TCP 3h37m

gsvc-serving-db07943b-00001 ClusterIP 172.20.247.29 <none> 80/TCP,443/TCP 3h37m

gsvc-serving-db07943b-00001-private ClusterIP 172.20.132.196 <none> 80/TCP,443/TCP,9090/TCP,9091/TCP,8022/TCP,8012/TCP 3h37m

Services after ingress-controller restart

kubectl get svc -n eb7d5189

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

gsvc-serving-db07943b ExternalName <none> knative-local-gateway.istio-system.svc.cluster.local 80/TCP 3h37m

gsvc-serving-db07943b-00001 ClusterIP 172.20.247.29 <none> 80/TCP,443/TCP 3h37m

gsvc-serving-db07943b-00001-private ClusterIP 172.20.132.196 <none> 80/TCP,443/TCP,9090/TCP,9091/TCP,8022/TCP,8012/TCP 3h37m

The external-ip from ExternalName turned from test-eb7d5189.serverless-dev.xyz.crashcourse.com to knative-local-gateway.istio-system.svc.cluster.local.

Some tests

From a "Ready" service

curl -o /dev/null -s -w "%{http_code}\n" http://gsvc-serving-5fb72450-00001.eb7d5189.svc.cluster.local

200

curl -o /dev/null -s -w "%{http_code}\n" http://gsvc-serving-5fb72450.eb7d5189.svc.cluster.local

200

From the "Unknown" service

curl -o /dev/null -s -w "%{http_code}\n" http://gsvc-serving-db07943b-00001.eb7d5189.svc.cluster.local

200

curl -o /dev/null -s -w "%{http_code}\n" http://gsvc-serving-db07943b.eb7d5189.svc.cluster.local

404

Steps to Reproduce the Problem

Set Ingress controllers (will create aws NLB, so you need a aws loadbalancer controller too)
Set knative
Deploy a service

Ingress controllers

My setup has some particularities. I use 3 different ingress controllers configured with the helm values as below:

helmCharts:
  - includeCRDs: true
    name: gateway
    namespace: istio-system
    releaseName: istio-ingressgateway
    repo: https://istio-release.storage.googleapis.com/charts
    version: 1.24.2
    valuesInline:
      service:
        type: ClusterIP

  - includeCRDs: true
    name: gateway
    namespace: istio-system
    releaseName: istio-internal-ingressgateway
    repo: https://istio-release.storage.googleapis.com/charts
    version: 1.24.2
    valuesInline:
      service:
        annotations:
          service.beta.kubernetes.io/aws-load-balancer-type: nlb-ip
          service.beta.kubernetes.io/aws-load-balancer-scheme: internal
          service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
          service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
      podAnnotations:
        proxy.istio.io/config: |
          {
            "gatewayTopology": {
              "proxyProtocol": {}
            }
          }
      labels:
        app: istio-internal-ingressgateway
        istio: ingressgateway

  - includeCRDs: true
    name: gateway
    namespace: istio-system
    releaseName: istio-external-ingressgateway
    repo: https://istio-release.storage.googleapis.com/charts
    version: 1.24.2
    valuesInline:
      service:
        annotations:
          service.beta.kubernetes.io/aws-load-balancer-type: nlb-ip
          service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
          service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
          service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
      podAnnotations:
        proxy.istio.io/config: |
          {
            "gatewayTopology": {
              "proxyProtocol": {}
            }
          }
      labels:
        app: istio-external-ingressgateway
        istio: ingressgateway

In case you wonder, I use proxy config for matching source IPs and use it in AuthorizationPolicy afterwards.

Knative

I deploy knative using the knative-operator as follow :

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
  annotations:
    gladiator.app/name: knative-operator
spec:
  version: "1.17"
  high-availability:
    replicas: 2
  config:
    istio:
      gateway.knative-serving.knative-external-ingress-gateway: istio-external-ingressgateway.istio-system.svc.cluster.local
      gateway.knative-serving.knative-internal-ingress-gateway: istio-internal-ingressgateway.istio-system.svc.cluster.local
    defaults:
      max-revision-timeout-seconds: "3600"
      revision-timeout-seconds: "1800"
      revision-response-start-timeout-seconds: '600'
    autoscaler:
      allow-zero-initial-scale: "true"
      enable-scale-to-zero: "true"
      initial-scale: "0"
    deployment:
      progress-deadline: "3600s"
    features:
      autodetect-http2: enabled
      kubernetes.containerspec-addcapabilities: disabled
      kubernetes.podspec-affinity: enabled
      kubernetes.podspec-dnsconfig: disabled
      kubernetes.podspec-dnspolicy: disabled
      kubernetes.podspec-dryrun: allowed
      kubernetes.podspec-fieldref: disabled
      kubernetes.podspec-hostaliases: disabled
      kubernetes.podspec-init-containers: enabled
      kubernetes.podspec-nodeselector: enabled
      kubernetes.podspec-persistent-volume-claim: enabled
      kubernetes.podspec-persistent-volume-write: enabled
      kubernetes.podspec-priorityclassname: disabled
      kubernetes.podspec-runtimeclassname: enabled
      kubernetes.podspec-schedulername: disabled
      kubernetes.podspec-securitycontext: enabled
      kubernetes.podspec-tolerations: enabled
      kubernetes.podspec-topologyspreadconstraints: disabled
      kubernetes.podspec-volumes-emptydir: enabled
      kubernetes.podspec-volumes-hostpath: enabled
      multi-container: enabled
      queueproxy.mount-podinfo: disabled
      tag-header-based-routing: disabled
      multi-container-probing: enabled
    gc:
      min-non-active-revisions: "0"
      max-non-active-revisions: "0"
      retain-since-create-time: "disabled"
      retain-since-last-active-time: "disabled"
    leader-election:
      lease-duration: 60s
    logging:
      loglevel.activator: info
      loglevel.autoscaler: info
      loglevel.controller: info
      loglevel.hpaautoscaler: info
      loglevel.net-certmanager-controller: info
      loglevel.net-contour-controller: info
      loglevel.net-istio-controller: info
      loglevel.queueproxy: info
      loglevel.webhook: info
    network:
      auto-tls: Disabled
      domain-template: '{{index .Annotations "service.serverless.xyz.crashcourse.com/hostname"}}.{{.Domain}}'
      ingress-class: "istio.ingress.networking.knative.dev"
    observability:
      logging.enable-probe-request-log: "true"
      logging.enable-request-log: "true"
      logging.request-log-template: >-
        {"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js
        .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}",
        "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}",
        "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js
        .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js
        .Request.Referer}}", "latency": {{.Response.Latency}}, "latencyNew":
        {{.Response.Latency}}, "protocol": "{{.Request.Proto}}"}, "traceId":
        "{{index .Request.Header "X-B3-Traceid"}}"}
      metrics.backend-destination: prometheus
      metrics.request-metrics-backend-destination: prometheus
    tracing:
      backend: none

domain-template is linked to an operator we have, so nvm that.

Knative Service

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: gsvc-serving-db07943b
  namespace: eb7d5189
  annotations:
    gladiator/url-prefix: test-
    service.serverless.xyz.crashcourse.com/endpoint: test-eb7d5189
    service.serverless.xyz.crashcourse.com/hostname: test-eb7d5189
  labels:
    app.kubernetes.io/component: serving
    app.kubernetes.io/part-of: service
    serverless.xyz.crashcourse.com/service-name: test
    service.serverless.xyz.crashcourse.com/endpoint: test-eb7d5189
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/max-scale: "1"
        autoscaling.knative.dev/metric: concurrency
        autoscaling.knative.dev/min-scale: "1"
    spec:
      containers:
        - image: ghcr.io/knative/helloworld-go:latest
          name: hello
          ports:
            - containerPort: 8080
          env:
            - name: TARGET
              value: "World"

Same for the annotations/labels, is it linked to the operator

Feb 06 '25 17:02 hyde404

Hi @hyde404 ,

The external-ip from ExternalName turned from test-eb7d5189.serverless-dev.xyz.crashcourse.com to knative-local-gateway.istio-system.svc.cluster.local.

The externalname should point to istio it is used for different purposes e.g. traffic splitting.

I haven't checked all the details yet but is that external name being exposed on the AWS lb directly somehow (due to your ingresses) or Istio is not picking up changes? Could you try a more standard approach as in Knative docs as a smoke test?

Probing of https://test-eb7d5189.serverless-dev.xyz.crashcourse.com:443 failed, IP: 100.64.174.122:443, ready: false, error: error roundtripping https://test-eb7d5189.serverless-dev.xyz.crashcourse.com:443/healthz: read tcp 100.64.162.193:36768->100.64.174.122:443: read: connection reset by peer (depth: 0)",

This the reason you see the loadbalancer not being ready. I am wondering why https is used, what is the istio mode you use mtls?

Note: Unfortunately I dont have an AWS cluster to test, so I am guessing.

Feb 07 '25 13:02 skonto

Hi @skonto,

Thanks for your reply ! The istio gateway mode I use is "simple", and the one set in the ingress contoller is apparently mTLS (controlPlaneAuthPolicy: MUTUAL_TLS). I tried with a standard approach by installing knative/istio/net-istio using this piece of documentation and got the exact same result.

By removing

        proxy.istio.io/config: |
          {
            "gatewayTopology": {
              "proxyProtocol": {}
            }
          }

from podAnnotations, it works, but we loose the ability to keep client source IP which is not desirable. A couple of combinations have been tested, based of this proxy config but we got no luck.

It seems like, a probe, maybe from net-istio has issues. Moreover, we came accross this feature request which really looks alike what we're facing right now.

Feb 11 '25 08:02 hyde404

having the exact same issue, when testing service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*" and "proxyProtocol": {}.

Feb 17 '25 18:02 lizzzcai

I finally managed to make it work to use these annotations

apiVersion: v1
kind: Service
metadata:
  annotations:
    ...
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*'
    service.beta.kubernetes.io/aws-load-balancer-scheme: internal
    service.beta.kubernetes.io/aws-load-balancer-type: nlb-ip
    ...

(you can also add service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true, even if it works without it)

and this envoyfilter as well

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  annotations:
    gladiator.app/name: istio-gateway
  name: internal-proxy-protocol
  namespace: istio-system
spec:
  configPatches:
  - applyTo: LISTENER
    patch:
      operation: MERGE
      value:
        listener_filters:
        - name: envoy.filters.listener.proxy_protocol
          typed_config:
            '@type': type.googleapis.com/envoy.extensions.filters.listener.proxy_protocol.v3.ProxyProtocol
            allow_requests_without_proxy_protocol: true
        - name: envoy.filters.listener.tls_inspector
          typed_config:
            '@type': type.googleapis.com/envoy.extensions.filters.listener.tls_inspector.v3.TlsInspector
  workloadSelector:
    labels:
      app: istio-internal-ingressgateway

So I totally got rid of "proxyProtocol": {}

Feb 18 '25 09:02 hyde404

Observed the same behavior with knative service in a Kubeflow 1.9.1 deployment (on-prem). Restart istio-system/istio-ingressgateway deployment will make the knative service accessible and also the knative service ExternalName changes to "knative-local-gateway.istio-system.svc.cluster.local".

Also observed in istio-ingressgateway log ' "GET /healthz HTTP/1.1" 404 NR route_not_found - "-" 0 0 0 - "192.168.0.238" "Knative-Ingress-Probe"' until restart the pod.

Easily reproducible on my cluster by running the Kserve sklearn-iris inferenceservice example.

Update:

I think I figured out the problem. Since I setup a domain entry in config-domain configMap, I need to add 'auto-tls: Disabled' in config-network if the knative service does not have https access setup. This can also be done in the knative service manifest by including in annotations 'networking.knative.dev/disableAutoTLS: "true"'.

Feb 19 '25 21:02 weistonedawei

I finally managed to make it work to use these annotations

apiVersion: v1
kind: Service
metadata:
  annotations:
    ...
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*'
    service.beta.kubernetes.io/aws-load-balancer-scheme: internal
    service.beta.kubernetes.io/aws-load-balancer-type: nlb-ip
    ...

(you can also add service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true, even if it works without it)

and this envoyfilter as well

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  annotations:
    gladiator.app/name: istio-gateway
  name: internal-proxy-protocol
  namespace: istio-system
spec:
  configPatches:
  - applyTo: LISTENER
    patch:
      operation: MERGE
      value:
        listener_filters:
        - name: envoy.filters.listener.proxy_protocol
          typed_config:
            '@type': type.googleapis.com/envoy.extensions.filters.listener.proxy_protocol.v3.ProxyProtocol
            allow_requests_without_proxy_protocol: true
        - name: envoy.filters.listener.tls_inspector
          typed_config:
            '@type': type.googleapis.com/envoy.extensions.filters.listener.tls_inspector.v3.TlsInspector
  workloadSelector:
    labels:
      app: istio-internal-ingressgateway

So I totally got rid of "proxyProtocol": {}

Hi @hyde404 thanks for your update. Tested your approach and it is working fine.

Mar 03 '25 07:03 lizzzcai

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Jun 02 '25 01:06 github-actions[bot]