aws-load-balancer-controller icon indicating copy to clipboard operation
aws-load-balancer-controller copied to clipboard

Prometheus support

Open runningman84 opened this issue 5 years ago • 30 comments

It would be great to have metrics per alb group where we see how many targets are registered and if there is any rule failure...

Right now an iam error like this can cause the whole system to fail:

hop-zed"],"leavingMembers":[]}
{"level":"error","ts":1585751311.2310958,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"alb-ingress-controller","request":"/spryker","error":"AccessDeniedException: 

with an corresponding metric we could monitor such problems... An external monitor (url check) would not help because some old targets might be still up and running.

runningman84 avatar Apr 01 '20 14:04 runningman84

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Jul 09 '20 16:07 fejta-bot

/remove-lifecycle stale.

runningman84 avatar Jul 09 '20 17:07 runningman84

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar Aug 08 '20 18:08 fejta-bot

@runningman84 Just to make sure I'm fully understanding the situation, there is 0 support for prometheus today?

As someone considering using ALB Ingress controller, am I correct to understand that there is no supported way to "monitor" or alert on controller reconciliation failures?

I could potentially alert based on logs in Splunk or something, but that's not very elegant.

clayvan avatar Aug 11 '20 13:08 clayvan

That's unfortunately correct, right now there is 0 support for prometheus...

You could run a custom container whiches parses the logs and publishes them as metrics....

runningman84 avatar Aug 11 '20 14:08 runningman84

/remove-lifecycle rotten

runningman84 avatar Aug 11 '20 14:08 runningman84

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Nov 09 '20 15:11 fejta-bot

/kind feature

kishorj avatar Nov 18 '20 22:11 kishorj

/remove-lifecycle stale

techdragon avatar Dec 09 '20 02:12 techdragon

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Mar 09 '21 03:03 fejta-bot

It looks like there are Prometheus metrics, but they do not expose any data on a per-ingress or per-service level. Furthermore, there is not any data in the ingress itself that indicates that there may be an issue.

In the previous alb-ingress-controller, there was a metric called aws_alb_ingress_controller_errors{ingress="<namespace>/<name>"}, and we could use this to help notify teams that their ingress was misconfigured.

With the current metrics, we can only alert when something is failing to reconcile, and we are required to parse logs to understand the specific issue.

This is a pretty major usability issue.

jutley avatar Jul 15 '21 21:07 jutley

/assign m00nf1sh Check if the new controller runtime provide some useful metrics.

kishorj avatar Jul 21 '21 22:07 kishorj

Add new metrics for ingress group usage like number of ingress groups and provisioned ALB in the cluster and count of errors encountered per group.

kishorj avatar Jul 28 '21 22:07 kishorj

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 26 '21 22:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 25 '21 23:11 k8s-triage-robot

/remove-lifecycle rotten

runningman84 avatar Nov 26 '21 22:11 runningman84

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 24 '22 23:02 k8s-triage-robot

/remove-lifecycle stale

pie-r avatar Mar 10 '22 21:03 pie-r

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 08 '22 21:06 k8s-triage-robot

/remove-lifecycle stale

runningman84 avatar Jun 09 '22 05:06 runningman84

i can see prometheus metrics like:

aws_api_call_retries_bucket{operation="CreateTargetGroup",service="Elastic Load Balancing v2",le="0"} 11

but not at targetgroup level

tooptoop4 avatar Jul 08 '22 05:07 tooptoop4

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 06 '22 06:10 k8s-triage-robot

/remove-lifecycle stale

tooptoop4 avatar Oct 06 '22 08:10 tooptoop4

+1

RamazanKara avatar Nov 15 '22 14:11 RamazanKara

+1

sanjeevpandey19 avatar Feb 06 '23 11:02 sanjeevpandey19

It looks like there are Prometheus metrics, but they do not expose any data on a per-ingress or per-service level. Furthermore, there is not any data in the ingress itself that indicates that there may be an issue.

In the previous alb-ingress-controller, there was a metric called aws_alb_ingress_controller_errors{ingress="<namespace>/<name>"}, and we could use this to help notify teams that their ingress was misconfigured.

With the current metrics, we can only alert when something is failing to reconcile, and we are required to parse logs to understand the specific issue.

This is a pretty major usability issue.

@kishorj @M00nF1sh any plans on fixing this regression? The current metric controller_runtime_reconcile_errors_total does not provide the same information that the previous aws_alb_ingress_controller_errors metric did.

dudicoco avatar May 06 '23 10:05 dudicoco

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 19 '24 19:01 k8s-triage-robot

/remove-lifecycle stale

runningman84 avatar Jan 19 '24 21:01 runningman84

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 18 '24 21:04 k8s-triage-robot

/remove-lifecycle stale

runningman84 avatar Apr 19 '24 04:04 runningman84