kube-state-metrics
kube-state-metrics copied to clipboard
CustomResourceDefinitions status fields cause spam of errors that cannot be fixed
What happened: Spam of errors that look like this:
"kube_customresource_phase" err="[status,phase]: expected value for path to be string, got <nil>"
What you expected to happen:
There should be no errors logged. Status fields are not guaranteed to exist at resource creation. The behavior is not consistent with known types where a default value is taken.
How to reproduce it (as minimally and precisely as possible):
Create cr-config.yaml:
kind: CustomResourceStateMetrics
spec:
resources:
- groupVersionKind:
group: samplecontroller.k8s.io
kind: "Foo"
version: v1alpha1
labelsFromPath:
name: [metadata, name]
namespace: [metadata, namespace]
metricNamePrefix: "cr"
metrics:
- name: replicas
each:
type: Gauge
gauge:
path: [status, availableReplicas]
nilIsZero: true
- name: test
each:
type: StateSet
stateSet:
labelName: phase
path: [status, phase]
list:
- Pending
- Provisioning
- Provisioned
- Running
- Deleting
- Deleted
- Failed
- Unknown
Create a CRD with status and a valid object. Do not run a controller (this is one of possible scenarios).
kubectl apply -f https://raw.githubusercontent.com/kubernetes/sample-controller/master/artifacts/examples/example-foo.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/sample-controller/master/artifacts/examples/crd.yaml
Run:
go run main.go --custom-resource-state-only --custom-resource-state-config-file cr-config.yaml --kubeconfig ~/.kube/config
The error repeats for every instance of a resource, and there can be thousands of such resources.
registry_factory.go:685] "cr_test" err="[status,phase]: expected value for path to be string, got <nil>"
Anything else we need to know?:
I believe is a general problem for all CRDs and all status fields. Since there can be many differing objects, the error isn't helpful enough. Might be useful to log this only in verbose mode with resource name and kind.
Environment: kind or any other Kubernetes cluster
- kube-state-metrics version: commit f7304dc8d8337ad4028c6e9f02dd47ab2fb0aa52
- Kubernetes version (use
kubectl version): Client Version: v1.30.2 (shouldn't matter) - Cloud provider or hardware configuration: kind Server Version: v1.30.0
- Other info: n/a
Actually nilIsZero might be a better solution giving users some control. It would set all states to zero by default similarly to gauges.
After some more thinking, nilIsZero makes sense only when used with a singular gauge where all labels are known upfront. For generated labels, there is no way to create a sensible metric when some labels are missing.
Thus it should be a normal case for series to not exist, not an error.
/assign @rexagod /triage accepted
Any movement on this? We get ALOT of these errors spammed in our logs and hard to know if kube-state-metrics is functioning.
While it's reasonable to put these errors behind a more verbose filter, the reason these happen is because KSM does not know how long to wait for the field population to happen, it just starts parsing the objects immediately, and reports any errors for fields that are absent at that point.
For e.g., VPA recommendations are generated a bit after the resource is deployed, so KSM reports the same errors there as well, before halting reporting these error logs as soon as the fields get populated.
Any updates on this issue ? we have multiple Vclusters where kube-state-metrics is installed , and we have a lot of this errors in logs. The problem is that the container is terminated because of OOMkilled after encountering this error many times specially for the deletion_timestamp metric
ACK, at this point I believe it's best to increase the verbosity level for these errors. I'll send a patch.