kube-state-metrics
kube-state-metrics copied to clipboard
Node selection for fully qualified node-names fails (--node=ip-xx-xx-xx-xx.myzone.com)
What happened:
I’m trying to use the kube-state-metrics pods in the DaemonSet mode with --resources=pods and --node=$(NODE_NAME)… in my local testing on a Kind environment, it worked fine. However when I run it in a real EKS cluster to test, I get an odd behavior. We see the fieldSelector get created with the node-name … but it’s missing the .'s:
eg:
│ containers: │ - args:
│ - -v=7
│ - --resources=pods
│ - --node="$(NODE_NAME)"
│ - --port=8080
│ env:
│ - name: NODE_NAME
│ valueFrom:
│ fieldRef:
│ apiVersion: v1
│ fieldPath: spec.nodeName
and then we see this:
│ I0417 20:56:30.604141 1 server.go:339] "Started kube-state-metrics self metrics server" telemetryAddress=":8081"
│ I0417 20:56:30.604284 1 builder.go:520] "FieldSelector is used" fieldSelector="spec.nodeName=ip-100-80-189-206us-west-2computeinternal"
│ I0417 20:56:30.604321 1 builder.go:282] "Active resources" activeStoreNames="pods"
│ I0417 20:56:30.604332 1 reflector.go:289] Starting reflector *v1.Pod (0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229
│ I0417 20:56:30.604342 1 reflector.go:325] Listing and watching *v1.Pod from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229
│ I0417 20:56:30.604381 1 server.go:73] levelinfomsgListening onaddress:8080
│ I0417 20:56:30.604414 1 server.go:73] levelinfomsgTLS is disabled.http2falseaddress:8080
│ I0417 20:56:30.604419 1 server.go:73] levelinfomsgListening onaddress:8081
│ I0417 20:56:30.604429 1 server.go:73] levelinfomsgTLS is disabled.http2falseaddress:8081
│ I0417 20:56:30.604442 1 round_trippers.go:463] GET https://172.20.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-100-80-189-206us-west-2computeinternal&limit=500&resourceVersion=0
│ I0417 20:56:30.604450 1 round_trippers.go:469] Request Headers: │ I0417 20:56:30.604456 1 round_trippers.go:473] Accept: application/vnd.kubernetes.protobuf,application/json
We can verify that we are passing ip-100-80-189-206.us-west-2.compute.internal into the CLI arg properly:
[root@admin]# ps -ef | grep kube-state
65534 1343367 1343293 0 20:56 ? 00:00:00 /kube-state-metrics --port=8080 --telemetry-port=8081 -v=7 --resources=pods --node="ip-100-80-189-206.us-west-2.compute.internal" --port=8080
The reason we looked into it is because the pod is coming up - but it’s not reporting any metrics:
% curl -v localhost:8080/metrics
* Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /metrics HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/plain; version=0.0.4; charset=utf-8
< Date: Wed, 17 Apr 2024 21:00:28 GMT
< Content-Length: 0
<
* Connection #0 to host localhost left intact
After digging, I found https://github.com/kubernetes/kube-state-metrics/pull/2217 which introduced a Regex Pattern that only matches hostnames, and not FQDNs at https://github.com/kubernetes/kube-state-metrics/blob/d1f04c2479c792d15e420255d5c6829fdd95766c/pkg/options/types.go#L142-L154.
What you expected to happen:
I expect that the input we pass in will be the input that is used - whether it is correct or not. I was completely thrown to see the code mutating my input, and effectively making the fieldSelector invalid.
Anything else we need to know?:
Environment:
- kube-state-metrics version: 2.12.20
- Kubernetes version (use
kubectl version): 1.28.4 - Cloud provider or hardware configuration: EKS
- Other info:
@CatherineF-dev put up a fix at https://github.com/kubernetes/kube-state-metrics/pull/2373 ... 🚤
/triage accepted /assign @CatherineF-dev
@diranged even through we have not tested v2.13.0 with a DS for this, I think we can tell from static analysis of the code that it should be fixed now. We are also unlikely to test this as the need for a DS is gone also with the fixes in v2.13.0.
... so, tl;dr, I think we can close this issue and re-open later if we see it again.
P.S. We are still keeping https://github.com/kubernetes/kube-state-metrics/issues/2372 open until we confirm that a KSM upgrade no longer cause the stale metrics issue that the DS was going to be a workaround for.