kube-state-metrics panic in 2.3.0 related to bad ingress in GKE

trafficstars

What happened: kube-state-metrics panics and fails to start using config based on examples/standard directory, deployed to custom namespace

I0103 12:34:28.368172       1 main.go:108] Using default resources
I0103 12:34:28.368270       1 types.go:136] Using all namespace
I0103 12:34:28.368278       1 main.go:133] metric allow-denylisting: Excluding the following lists that were on denylist:
W0103 12:34:28.368317       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0103 12:34:28.369031       1 main.go:247] Testing communication with server
I0103 12:34:28.375030       1 main.go:252] Running with Kubernetes cluster version: v1.21. git version: v1.21.5-gke.1302. git tree state: clean. commit: 639f3a74abf258418493e9b75f2f98a08da29733. platform: linux/amd64
I0103 12:34:28.375060       1 main.go:254] Communication with server successful
I0103 12:34:28.375263       1 main.go:210] Starting metrics server: [::]:8080
I0103 12:34:28.375364       1 metrics_handler.go:96] Autosharding disabled
I0103 12:34:28.375393       1 main.go:199] Starting kube-state-metrics self metrics server: [::]:8081
I0103 12:34:28.375535       1 main.go:66] levelinfomsgTLS is disabled.http2false
I0103 12:34:28.375611       1 main.go:66] levelinfomsgTLS is disabled.http2false
I0103 12:34:28.376853       1 builder.go:192] Active resources: certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
W0103 12:34:28.379101       1 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob
W0103 12:34:28.395480       1 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob
W0103 12:34:28.410311       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0103 12:34:28.414499       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
E0103 12:34:28.421128       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 76 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1631020, 0x2685d50})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0x7d
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x2540be400})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1631020, 0x2685d50})
	/usr/local/go/src/runtime/panic.go:1038 +0x215
k8s.io/kube-state-metrics/v2/internal/store.ingressMetricFamilies.func6(0x70)
	/go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:136 +0x189
k8s.io/kube-state-metrics/v2/internal/store.wrapIngressFunc.func1({0x17f9880, 0xc000422da0})
	/go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:175 +0x49
k8s.io/kube-state-metrics/v2/pkg/metric_generator.(*FamilyGenerator).Generate(...)
	/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:67
k8s.io/kube-state-metrics/v2/pkg/metric_generator.ComposeMetricGenFuncs.func1({0x17f9880, 0xc000422da0})
	/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add(0xc00014ce40, {0x17f9880, 0xc000422da0})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:72 +0xd4
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Replace(0xc00014ce40, {0xc00056bc00, 0x3f, 0x1443e01}, {0xc000607be0, 0xc06cd935190aabc6})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:134 +0xa5
k8s.io/client-go/tools/cache.(*Reflector).syncWith(0xc0004be8c0, {0xc00056b800, 0x3f, 0x0}, {0xc0007de000, 0x9})
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:456 +0x98
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch.func1(0xc0004be8c0, 0xc0000fca20, 0xc0004a0480, 0xc000607d60)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:354 +0x7ab
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc0004be8c0, 0xc0004a0480)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:361 +0x265
k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:221 +0x26
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f6244e784b0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000166280, {0x1a0dd20, 0xc0004d26e0}, 0x1, 0xc0004a0480)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/client-go/tools/cache.(*Reflector).Run(0xc0004be8c0, 0xc0004a0480)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:220 +0x1f8
created by k8s.io/kube-state-metrics/v2/internal/store.(*Builder).startReflector
	/go/src/k8s.io/kube-state-metrics/internal/store/builder.go:427 +0x2c8
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x14844a9]

goroutine 76 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x2540be400})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x1631020, 0x2685d50})
	/usr/local/go/src/runtime/panic.go:1038 +0x215
k8s.io/kube-state-metrics/v2/internal/store.ingressMetricFamilies.func6(0x70)
	/go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:136 +0x189
k8s.io/kube-state-metrics/v2/internal/store.wrapIngressFunc.func1({0x17f9880, 0xc000422da0})
	/go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:175 +0x49
k8s.io/kube-state-metrics/v2/pkg/metric_generator.(*FamilyGenerator).Generate(...)
	/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:67
k8s.io/kube-state-metrics/v2/pkg/metric_generator.ComposeMetricGenFuncs.func1({0x17f9880, 0xc000422da0})
	/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add(0xc00014ce40, {0x17f9880, 0xc000422da0})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:72 +0xd4
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Replace(0xc00014ce40, {0xc00056bc00, 0x3f, 0x1443e01}, {0xc000607be0, 0xc06cd935190aabc6})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:134 +0xa5
k8s.io/client-go/tools/cache.(*Reflector).syncWith(0xc0004be8c0, {0xc00056b800, 0x3f, 0x0}, {0xc0007de000, 0x9})
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:456 +0x98
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch.func1(0xc0004be8c0, 0xc0000fca20, 0xc0004a0480, 0xc000607d60)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:354 +0x7ab
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc0004be8c0, 0xc0004a0480)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:361 +0x265
k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:221 +0x26
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f6244e784b0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000166280, {0x1a0dd20, 0xc0004d26e0}, 0x1, 0xc0004a0480)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/client-go/tools/cache.(*Reflector).Run(0xc0004be8c0, 0xc0004a0480)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:220 +0x1f8
created by k8s.io/kube-state-metrics/v2/internal/store.(*Builder).startReflector
	/go/src/k8s.io/kube-state-metrics/internal/store/builder.go:427 +0x2c8

What you expected to happen: kube-state-metrics to start without error

How to reproduce it (as minimally and precisely as possible): Deploy kube-state-metrics to GKE 1.21.5 in custom namespace kubectl create namespace observability

Full k8s yaml to apply

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
  namespace: observability
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs:
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  - daemonsets
  - deployments
  - replicasets
  verbs:
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - cronjobs
  - jobs
  verbs:
  - list
  - watch
- apiGroups:
  - autoscaling
  resources:
  - horizontalpodautoscalers
  verbs:
  - list
  - watch
- apiGroups:
  - authentication.k8s.io
  resources:
  - tokenreviews
  verbs:
  - create
- apiGroups:
  - authorization.k8s.io
  resources:
  - subjectaccessreviews
  verbs:
  - create
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - list
  - watch
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests
  verbs:
  - list
  - watch
- apiGroups:
  - storage.k8s.io
  resources:
  - storageclasses
  - volumeattachments
  verbs:
  - list
  - watch
- apiGroups:
  - admissionregistration.k8s.io
  resources:
  - mutatingwebhookconfigurations
  - validatingwebhookconfigurations
  verbs:
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  - ingresses
  verbs:
  - list
  - watch
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: observability
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
  namespace: observability
spec:
  clusterIP: None
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
  - name: telemetry
    port: 8081
    targetPort: telemetry
  selector:
    app.kubernetes.io/name: kube-state-metrics
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
  namespace: observability
spec:
  progressDeadlineSeconds: 120
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8080"
        prometheus.io/scrape: "true"
      labels:
        app.kubernetes.io/name: kube-state-metrics
    spec:
      containers:
      - image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.3.0
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http-metrics
        - containerPort: 8081
          name: telemetry
        readinessProbe:
          httpGet:
            path: /
            port: 8081
          initialDelaySeconds: 5
          timeoutSeconds: 5
        securityContext:
          runAsUser: 65534
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: kube-state-metrics

Anything else we need to know?:

Environment:

kube-state-metrics version: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.3.0
Kubernetes version (use kubectl version): Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:16:05Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5-gke.1302", GitCommit:"639f3a74abf258418493e9b75f2f98a08da29733", GitTreeState:"clean", BuildDate:"2021-10-21T21:35:48Z", GoVersion:"go1.16.7b7", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: GCP Managed GKE
Other info:

Jan 03 '22 12:01 bizrad

I also see a very similar failure if I downgrade to 2.2.4 or 2.2.3

E0103 12:40:13.394862       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 71 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x18101a0, 0x26c12f0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x86
panic(0x18101a0, 0x26c12f0)
	/usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/kube-state-metrics/v2/internal/store.ingressMetricFamilies.func6(0xc000806110, 0xc00055cc40)
	/go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:136 +0x192
k8s.io/kube-state-metrics/v2/internal/store.wrapIngressFunc.func1(0x19edb60, 0xc000806110, 0xc000ae9d00)
	/go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:175 +0x5c
k8s.io/kube-state-metrics/v2/pkg/metric_generator.(*FamilyGenerator).Generate(...)
	/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:58

Jan 03 '22 12:01 bizrad

Hey :wave: thanks for reporting this. Do you know for which ingress this happens? Do you have any ingresses without a backend or a service backend? Looking at the code, that's the only thing which seems to be the root cause: https://github.com/kubernetes/kube-state-metrics/blob/e080c3ce73ad514254e38dccb37c93bec6b257ae/internal/store/ingress.go#L136

Jan 03 '22 12:01 fpetkovski

@fpetkovski thanks for pointing that out. We had an ingress with a bad backend and removing it resolved the issue. However, I would still say having a single broken ingress should not be cause for crashing all of kube-state-metrics and there is still a bug here.

Error from bad ingress Translation failed: invalid ingress spec: Ingress Backend is not a service

Jan 03 '22 13:01 bizrad

Thanks for checking that. I agree that this is still a bug and we should keep the issue open until it is resolved :+1:

Jan 03 '22 13:01 fpetkovski

/assign

Jan 04 '22 05:01 slashpai

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 04 '22 05:04 k8s-triage-robot

/remove-lifecycle stale

Apr 04 '22 17:04 bizrad

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 03 '22 17:07 k8s-triage-robot

/unassign since I didnt get enough time to work on this

Jul 04 '22 05:07 slashpai

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Aug 03 '22 05:08 k8s-triage-robot

/remove-lifecycle rotten

Aug 18 '22 12:08 mk46

kube-state-metrics kube-state-metrics copied to clipboard

panic in 2.3.0 related to bad ingress in GKE

kube-state-metrics
kube-state-metrics copied to clipboard