Kubernetes CRD's Statuses in k8s
Welcome!
- [X] Yes, I've searched similar issues on GitHub and didn't find any.
- [X] Yes, I've searched similar issues on the Traefik community forum and didn't find any.
What did you expect to see?
Hey,
I've been working with traefik lately and we've been missing configuration issues on bulk updates. Therefore I started to look into getting some alerting setup on the statuses of our Routers, Services and Middlewares. This is where things started to get tricky.
For context, we're using traefik on several k8s clusters and make use of the traefik CRD's IngressRoutes and Middlewares. In the traefik web UI, it is easy to discover if one of these is badly configured because traefik reports an error.
I'm using k8s CRD's, which means that these can have a status. The status of a k8s custom resource allows for services inside the k8s cluster to query the current state of it. Examples that implement those are external-dns and cert-manager. But I noticed that traefik CRD's do not present statuses. These status could have been captured by kube-state-metrics (a service that can collect the state of a lot of k8s resources) and I would have been able to monitor those.
Add statuses to traefik CRD's
As a kubernetes administrator, I want to be able to see the statuses of my CRD's so that I can collect and act upon it.
Description
Add a status field on the CRD's to be populate by traefik to allow external service to use the k8s API's to fetch the current status of the custom resources.
Expose a status filed for the different traefik CRD's:
- IngressRoute
- IngressRouteTCP
- IngressRouteUDP
- Middleware
- MiddlewareTCP
- ServersTransport
- TLSOption
- TLSStore
- TraefikService
For each CRD, I would add a status field (if applicable) which would look like:
metadata:
...
status:
conditions:
- lastTransitionTime: '<timestamp>'
message: IngressRoute is up to date.
observedGeneration: 2
reason: Ready
status: 'True'
type: Ready
revision: 16
spec:
...
Technical Implementation
I do not have sufficient background in k8s CRD development to be able to give any advice on this.
I'd like to give a special thanks to the traefik team and all of it's contributors for this amazing project.
Hey @zopanix! Thanks for your suggestion.
We are interested in this issue but are unsure about the use case and the traction it will receive, so we are going to leave the status as "kind/proposal" to give the community time to let us know that they would like this.
Note that we've linked it to a meta issue about CRD improvements we could bring.
Hi,
At my company we are facing the same problems, there was a templating error in our middleware and went unnoticed and while we wanted to improve on this we could not find any available metric, but just the logs.
What we noticed though is that there's a metric traefik_config_reloads_total which is updated every time a middleware changes and the other metric traefik_config_last_reload_failure is always set to 0. I suppose it only keeps track of serious misconfigurations. Since the first metric is increase every time a middleware's changed, would it be possible to increase the second one if an error is faced?
It will not give any feedback on which middleware has failed, but at least we would know there's something wrong going on. Is it reasonable what I am saying?
Thanks a lot in advance!
Hey @joe-pll,
Thanks for reaching out.
The issue you are raising is already tracked here.
The metrics you mentioned were removed from v3 because the notion of failure in the API contract doesn't allow us to increase its value correctly.
However, we would like to know about your needs and ideas concerning misconfiguration metrics. Could you open a dedicated issue?
By waiting for it, I mark your comment as off-topic for the current proposal.
Hi @nmengin,
Thanks a lot for you answer. Before creating a new issue I want to be sure, because our case really looks like the first of the two proposed by zopanix
As a traefik operator, I want to be able to provide central observability on traefik resources so that I can alert on failures and/or misconfigurations.
Description
Expose additional metrics about the state of the different traefik resources:
entrypoints
routers
middlewares
services
For each resource, I would add two additional metrics at least which are:
status: a value that represents the current state of the resource.
created: a unix timestamp of when the resource was created.
Note: For the status, in the prometheus community, usually, you put the sate as a label on the metric, and put the value to 1 if it is the current state and put to 0 on the other states. Example: traefik_router_status{"router_name"="foo", "status"="Error", ...}
As I mentioned we have a misconfiguration and it went unnoticed for day until it was actually used. Do you want to keep only the second case in this issue? If it's the case then I will create a new issue.
Hello @joe-pll,
Thanks for this feedback. You are right, and we should have asked before to open different issues for the two topics.
Could you please open a new issue for your case (additional metrics)? Once opened, we will edit this issue to keep it focused on the CRD status update.
Thanks @joe-pll for opening #10236, this issue now only focuses on providing a status update mechanism for Kubernetes Traefik CRs.
In addition to a status field for conditions, could we also have e.g. a hostname published for IngressRoute objects? Similar to how normal Ingress objects do today.
That would help with https://github.com/kubernetes-sigs/external-dns/issues/3967
Hello there,
Thanks for your suggestion @zopanix, we think it makes a lot of sense.
Unfortunately, this would not make it to our roadmap for a while as we are focused elsewhere. If you or another community member would like to build it, let us know, and we will work with you to ensure you have all the information needed to merge it.
We'd prefer to work with our community members at the beginning of the design process to ensure that we are aligned and can move quickly with the review and merge process.
Let us know here or create a PR before you start, and we will work with you there.
Don’t forget to check out the contributor docs and link the PR to this issue.
+1
+1 Could help us a lot currently as it seems like the Traefik metrics don't really show this, we wish to be alerted when IngressRoutes/Middlewares are not working (and not manually looking at the Traefik dashboard)
+1