aws-load-balancer-controller
aws-load-balancer-controller copied to clipboard
Prometheus support
It would be great to have metrics per alb group where we see how many targets are registered and if there is any rule failure...
Right now an iam error like this can cause the whole system to fail:
hop-zed"],"leavingMembers":[]}
{"level":"error","ts":1585751311.2310958,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"alb-ingress-controller","request":"/spryker","error":"AccessDeniedException:
with an corresponding metric we could monitor such problems... An external monitor (url check) would not help because some old targets might be still up and running.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
@runningman84 Just to make sure I'm fully understanding the situation, there is 0 support for prometheus today?
As someone considering using ALB Ingress controller, am I correct to understand that there is no supported way to "monitor" or alert on controller reconciliation failures?
I could potentially alert based on logs in Splunk or something, but that's not very elegant.
That's unfortunately correct, right now there is 0 support for prometheus...
You could run a custom container whiches parses the logs and publishes them as metrics....
/remove-lifecycle rotten
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/kind feature
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
It looks like there are Prometheus metrics, but they do not expose any data on a per-ingress or per-service level. Furthermore, there is not any data in the ingress itself that indicates that there may be an issue.
In the previous alb-ingress-controller, there was a metric called aws_alb_ingress_controller_errors{ingress="<namespace>/<name>"}, and we could use this to help notify teams that their ingress was misconfigured.
With the current metrics, we can only alert when something is failing to reconcile, and we are required to parse logs to understand the specific issue.
This is a pretty major usability issue.
/assign m00nf1sh Check if the new controller runtime provide some useful metrics.
Add new metrics for ingress group usage like number of ingress groups and provisioned ALB in the cluster and count of errors encountered per group.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
i can see prometheus metrics like:
aws_api_call_retries_bucket{operation="CreateTargetGroup",service="Elastic Load Balancing v2",le="0"} 11
but not at targetgroup level
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
+1
+1
It looks like there are Prometheus metrics, but they do not expose any data on a per-ingress or per-service level. Furthermore, there is not any data in the ingress itself that indicates that there may be an issue.
In the previous alb-ingress-controller, there was a metric called
aws_alb_ingress_controller_errors{ingress="<namespace>/<name>"}, and we could use this to help notify teams that their ingress was misconfigured.With the current metrics, we can only alert when something is failing to reconcile, and we are required to parse logs to understand the specific issue.
This is a pretty major usability issue.
@kishorj @M00nF1sh any plans on fixing this regression? The current metric controller_runtime_reconcile_errors_total does not provide the same information that the previous aws_alb_ingress_controller_errors metric did.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale