webhook: Admission metrics are not reported by status code
This is several bugs in one bug since it appears multiple things are broken. Apologies if it gets too complicated.
Problem 1: It appears that controller-runtime webhooks always respond with HTTP status code 200 OK even if the request is completely malformed.
Example (sending a malformed request to a webhook endpoint):
curl -k -v https://localhost:9443/validate-ship-example-org-v1beta1-frigate
< HTTP/2 200
< ... other response headers ...
<
{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}
Problem 2: controller-runtime webhooks do not use the metav1.Status.Code and always respond with HTTP status code 200 OK.
Example code I used to register my custom handler at /test:
webhookServer.Register("/test", &webhook.Admission{Handler: &myHandler{}})
...
type myHandler struct{}
func (*myHandler) Handle(context.Context, admission.Request) admission.Response {
return admission.Response{
AdmissionResponse: admissionv1.AdmissionResponse{
Allowed: false,
Result: &metav1.Status{
Code: 403,
Message: "things suck",
Status: "Failure",
Reason: metav1.StatusReasonForbidden}}}
}
And here's the query I am making:
curl -XPOST -k -v -H "Content-Type: application/json" https://localhost:9443/test
< HTTP/2 200
...
<
{"kind":"AdmissionReview","apiVersion":"admission.k8s.io/v1","response":{"uid":"","allowed":false,"status":{"metadata":{},"status":"Failure","message":"things suck","reason":"Forbidden","code":403}}}
Problem 3: controller-runtime doesn't expose any metrics about rejected requests and their .status.codes. This is likely because
- we always report code=200 no matter the response (Problem 1 & 2)
- it appears we're capping
controller_runtime_webhook_requests_total{code=...}dimension to only200and500here: https://github.com/kubernetes-sigs/controller-runtime/blob/4cae9dfbf174fa5d49ebca007bd6f3784cf1ffb3/pkg/webhook/internal/metrics/metrics.go#L70-L76
so even if I make a faulty request, only the code=200 metric goes up:
controller_runtime_webhook_requests_total{code="200",webhook="/test"} 4
In an ideal state, I'd like to see controller_runtime_webhook_requests_total metric actually let me understand things like how many requests I'm denying (non-200 codes, such as 403, 400, 429, ...).
/kind bug
I believe the 200 response is if the webhook can create the response. There was an issue around this with failure policy and not sure if this helps with your issue https://github.com/kubernetes-sigs/controller-runtime/issues/2662.
Pretty much what troy said. I do agree that this is confusing and that it would be helpful to have metrics around the status code in the admission response.
Indeed, after creating the issue I've realized HTTP response code must be 200 regardless of the admission response (which contains the status code the apiserver would return in its response body).
Maybe we could still add a metric that parses the code out from admission.Response? (I understand we can't help if someone actually decides to implement their own http.Handler).
/remove-kind bug /kind feature
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/reopen /remove-lifecycle rotten
Adding a metric seems fine
@sbueringer: Reopened this issue.
In response to this:
/reopen /remove-lifecycle rotten
Adding a metric seems fine
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten /lifecycle frozen