cloudwatch_exporter icon indicating copy to clipboard operation
cloudwatch_exporter copied to clipboard

Dimensions problem with AWS/ApplicationELB:HTTPCode_Target_5XX_Count

Open StochasticPirate opened this issue 2 years ago • 21 comments

Version: prom/cloudwatch-exporter:v0.14.3

Config:

region: eu-west-1 metrics: : :

  • aws_namespace: AWS/ApplicationELB aws_metric_name: HTTPCode_Target_4XX_Count aws_dimensions: [AvailabilityZone, LoadBalancer] aws_statistics: [Sum]

  • aws_namespace: AWS/ApplicationELB aws_metric_name: HTTPCode_Target_5XX_Count aws_dimensions: [AvailabilityZone, LoadBalancer] aws_statistics: [Sum]

Problem: Every combination of dimensions: (AvailabilityZone, LoadBalancer, TargetGroup), each singly, every pair, all three, gives the following log output and the metric does not appear in Prometheus.

Jul 06, 2022 1:26:57 PM org.eclipse.jetty.server.Server doStart INFO: jetty-11.0.9; built: 2022-03-30T17:44:47.085Z; git: 243a48a658a183130a8c8de353178d154ca04f04; jvm 17.0.3+7 Jul 06, 2022 1:26:57 PM org.eclipse.jetty.server.handler.ContextHandler doStart INFO: Started o.e.j.s.ServletContextHandler@14c01636{/,null,AVAILABLE} Jul 06, 2022 1:26:57 PM org.eclipse.jetty.server.AbstractConnector doStart INFO: Started ServerConnector@3a3e78f{HTTP/1.1, (http/1.1)}{0.0.0.0:9106} Jul 06, 2022 1:26:57 PM org.eclipse.jetty.server.Server doStart INFO: Started Server@7b5a12ae{STARTING}[11.0.9,sto=0] @630ms Jul 06, 2022 1:27:06 PM io.prometheus.cloudwatch.CloudWatchCollector listDimensions WARNING: (listDimensions) ignoring metric AWS/ApplicationELB:HTTPCode_Target_4XX_Count due to dimensions mismatch Jul 06, 2022 1:27:06 PM io.prometheus.cloudwatch.CloudWatchCollector listDimensions WARNING: (listDimensions) ignoring metric AWS/ApplicationELB:HTTPCode_Target_5XX_Count due to dimensions mismatch Jul 06, 2022 1:27:25 PM io.prometheus.cloudwatch.CloudWatchCollector listDimensions WARNING: (listDimensions) ignoring metric AWS/ApplicationELB:HTTPCode_Target_4XX_Count due to dimensions mismatch Jul 06, 2022 1:27:25 PM io.prometheus.cloudwatch.CloudWatchCollector listDimensions WARNING: (listDimensions) ignoring metric AWS/ApplicationELB:HTTPCode_Target_5XX_Count due to dimensions mismatch

Any pointers appreciated

StochasticPirate avatar Jul 06 '22 13:07 StochasticPirate

Hi @StochasticPirate ,

Please see this comment as a pointer - https://github.com/prometheus/cloudwatch_exporter/issues/432#issuecomment-1149572350

In general - seems like you want to run the following:

aws cloudwatch list-metrics --namespace AWS/ApplicationELB --metric-name HTTPCode_Target_4XX_Count
# and after...
# aws cloudwatch list-metrics --namespace AWS/ApplicationELB --metric-name HTTPCode_Target_5XX_Count

and see which dimension combination interest you.

or-shachar avatar Jul 06 '22 14:07 or-shachar

aws cloudwatch list-metrics --namespace AWS/ApplicationELB --metric-name HTTPCode_Target_4XX_Count and after... aws cloudwatch list-metrics --namespace AWS/ApplicationELB --metric-name HTTPCode_Target_5XX_Count

Yes, thanks, that returns 3 combinations: [LB, AZ], [TG, LB. AZ], [TG, LB].

All of these combinations return the "ignoring metric due to dimensions mismatch" warning.

StochasticPirate avatar Jul 06 '22 14:07 StochasticPirate

Interesting... Sharing with you the code here:

do {
      requestBuilder.nextToken(nextToken);
      ListMetricsResponse response = cloudWatchClient.listMetrics(requestBuilder.build());
      cloudwatchRequests.labels("listMetrics", rule.awsNamespace).inc();
      for (Metric metric : response.metrics()) {
        if (metric.dimensions().size() != dimensionFilters.size()) {
          // AWS returns all the metrics with dimensions beyond the ones we ask for,
          // so filter them out.
          continue;
        }
        if (useMetric(rule, tagBasedResourceIds, metric)) {
          dimensions.add(metric.dimensions());
        }
      }
      nextToken = response.nextToken();
    } while (nextToken != null);
    if (dimensions.isEmpty()) {
      LOGGER.warning(
          String.format(
              "(listDimensions) ignoring metric %s:%s due to dimensions mismatch",
              rule.awsNamespace, rule.awsMetricName));
    }

Several options that may cause this problem:

  1. response.metrics() returns empty metric (maybe you got the region wrong? do you see any other metrics using the tool? maybe something is off with credentials?)
  2. metric.dimensions().size() != dimensionFilters.size() - maybe you have some hidden characters in your YAML that makes the list size be different?
  3. useMetric(rule, tagBasedResourceIds, metric) returns false always - for all returned metrics. Since you didn't specify awsDimensionSelect/awsTagSelect/awsDimensionSelectRegex - I think this option is less likely.

or-shachar avatar Jul 06 '22 16:07 or-shachar

Tested it locally with this configuration:

region: us-east-2
use_get_metric_data: true
metrics:
- aws_namespace: AWS/ApplicationELB
  aws_metric_name: HTTPCode_Target_4XX_Count
  aws_dimensions: [AvailabilityZone, LoadBalancer]
  aws_statistics: [Sum]

Seems to be working fine.

or-shachar avatar Jul 06 '22 16:07 or-shachar

Thanks for the response. I tried your exact config file (changing only the region to eu-west-1) and still get the following log:

Jul 06, 2022 5:31:47 PM org.eclipse.jetty.server.Server doStart INFO: jetty-11.0.9; built: 2022-03-30T17:44:47.085Z; git: 243a48a658a183130a8c8de353178d154ca04f04; jvm 17.0.3+7 Jul 06, 2022 5:31:47 PM org.eclipse.jetty.server.handler.ContextHandler doStart INFO: Started o.e.j.s.ServletContextHandler@56a4479a{/,null,AVAILABLE} Jul 06, 2022 5:31:47 PM org.eclipse.jetty.server.AbstractConnector doStart INFO: Started ServerConnector@4dd02341{HTTP/1.1, (http/1.1)}{0.0.0.0:9106} Jul 06, 2022 5:31:47 PM org.eclipse.jetty.server.Server doStart INFO: Started Server@60975100{STARTING}[11.0.9,sto=0] @618ms Jul 06, 2022 5:32:05 PM io.prometheus.cloudwatch.CloudWatchCollector listDimensions WARNING: (listDimensions) ignoring metric AWS/ApplicationELB:HTTPCode_Target_4XX_Count due to dimensions mismatch Jul 06, 2022 5:32:24 PM io.prometheus.cloudwatch.CloudWatchCollector listDimensions WARNING: (listDimensions) ignoring metric AWS/ApplicationELB:HTTPCode_Target_4XX_Count due to dimensions mismatch

When you say "working fine" do you mean that you don't see the warning in your logs?

Thanks.

StochasticPirate avatar Jul 06 '22 17:07 StochasticPirate

When you say "working fine" do you mean that you don't see the warning in your logs?

yes. No warnings / errors.

Do you have a way to debug the code locally or to double check that you're running against the same account you used when you ran the aws cloudwatch CLI?

or-shachar avatar Jul 06 '22 17:07 or-shachar

Definitely running against the same account.

On 6 Jul 2022, at 18:59, Or Shachar @.***> wrote:

When you say "working fine" do you mean that you don't see the warning in your logs?

yes. No errors.

Do you have a way to debug the code locally or to double check that you're running against the same account you used when you ran the aws cloudwatch CLI?

— Reply to this email directly, view it on GitHub https://github.com/prometheus/cloudwatch_exporter/issues/442#issuecomment-1176518038, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ6FZRFMJD5QM6NPRV5S23LVSXCQFANCNFSM52Z2WYRA. You are receiving this because you were mentioned.

StochasticPirate avatar Jul 06 '22 18:07 StochasticPirate

I can try to create a new version with a little more debug information to understand where exactly is the problem. But it may take few days... alternatively - do you have a way to run the application in debug mode?

or-shachar avatar Jul 06 '22 20:07 or-shachar

Thanks. I was going to have a look today to see whether there is a way to create more debugging output. Also I was going to look at permissions, although the application is able to create some other metrics ok. It seems to be the [2|3|4|5]XX metrics that it struggles with. If you could produce a version with some extra debugging that'd be great, thankyou 👍

StochasticPirate avatar Jul 07 '22 07:07 StochasticPirate

I am running in similar kind of issue here but for AWS/Transfer service. What I observed is until I don't push file to SFTP server, I kept getting these warning for all the Metrics under AW/Transfer namespace but as soon as I push something to Transfer service and as soon as AWS Cloudwatch Metrics got something to show up, these warnings disappear.

- aws_namespace: AWS/Transfer
   aws_metric_name: BytesIn
  aws_dimensions: [ServerId]
  aws_tag_select:
   tag_selections:
     "tl:prefix": ["dev"]
   resource_type_selection: "transfer:server"
   resource_id_dimension: ServerId
  aws_statistics: [Sum]
  use_get_metric_data: true
  period_seconds: 360

- aws_namespace: AWS/Transfer
  aws_metric_name: BytesOut
  aws_dimensions: [ServerId]
  aws_tag_select:
   tag_selections:
     "tl:prefix": ["dev"]
   resource_type_selection: "transfer:server"
   resource_id_dimension: ServerId
  aws_statistics: [Sum]
  use_get_metric_data: true
  period_seconds: 540

- aws_namespace: AWS/Transfer
  aws_metric_name: FilesIn
  aws_dimensions: [ServerId]
  aws_tag_select:
   tag_selections:
     "tl:prefix": ["dev"]
   resource_type_selection: "transfer:server"
   resource_id_dimension: ServerId
  aws_statistics: [Sum]
  use_get_metric_data: true
  period_seconds: 420

- aws_namespace: AWS/Transfer
  aws_metric_name: FilesOut
  aws_dimensions: [ServerId]
  aws_tag_select:
   tag_selections:
     "tl:prefix": ["dev"]
   resource_type_selection: "transfer:server"
   resource_id_dimension: ServerId
  aws_statistics: [Sum]
  use_get_metric_data: true
  period_seconds: 480

- aws_namespace: AWS/Transfer
  aws_metric_name: OnUploadExecutionsStarted
  aws_dimensions: [ServerId]
  aws_tag_select:
   tag_selections:
     "tl:prefix": ["dev"]
   resource_type_selection: "transfer:server"
   resource_id_dimension: ServerId
  aws_statistics: [Sum]
  use_get_metric_data: true
  period_seconds: 600

- aws_namespace: AWS/Transfer
  aws_metric_name: OnUploadExecutionsFailed
  aws_dimensions: [ServerId]
  aws_tag_select:
   tag_selections:
     "tl:prefix": ["dev"]
   resource_type_selection: "transfer:server"
   resource_id_dimension: ServerId
  aws_statistics: [Sum]
  use_get_metric_data: true
  period_seconds: 660

- aws_namespace: AWS/Transfer
  aws_metric_name: OnUploadExecutionsSuccess
  aws_dimensions: [ServerId]
  aws_tag_select:
    tag_selections:
      "tl:prefix": ["dev"]
    resource_type_selection: "transfer:server"
    resource_id_dimension: ServerId
  aws_statistics: [Sum]
  use_get_metric_data: true
  period_seconds: 720

But in this also, if any of these metrics are not having any data to show up, then also we get these warnings for that particular metrics. For ex: in my case, if there is no error in the Transfer Service then we don't get any data in OnUploadExecutionsFailed metric and due to this I keep getting these warnings: 2022-07-13T09:37:32+05:30 Jul 13, 2022 4:07:32 AM io.prometheus.cloudwatch.CloudWatchCollector listDimensions 2022-07-13T09:37:32+05:30 WARNING: (listDimensions) ignoring metric AWS/Transfer:OnUploadExecutionsFailed due to dimensions mismatch @or-shachar Any suggestion how to resolve this?

avizvaRumit avatar Jul 13 '22 04:07 avizvaRumit

Interesting... I guess that's exactly the case I was curious about when I added the warning feature.

In my experience missing dimensions == some kind of configuration error that needs to be reported to the logs.

But IIUC there are some cases where the metric is not reported continuously and then the metric would be missing by design.

We can maybe add a flag to suppress those warnings for certain metrics. WDYT? @matthiasr

or-shachar avatar Jul 13 '22 16:07 or-shachar

We are also hitting this "issue". Our YAML has over 3500 lines, and we have hundreds of AWS accounts - each one with unique combination of services. On each one we are getting different warnings (probably because different services are in use).

krzysztof-magosa avatar Jul 14 '22 16:07 krzysztof-magosa

@or-shachar Also, is there a way to stop getting these Warnings:

WARNING: CloudWatch scrape failed software.amazon.awssdk.services.cloudwatch.model.CloudWatchException: Rate exceeded (Service: CloudWatch, Status Code: 400, Request ID: aaa68484-bf4e-4fca-8d82-93d53b9134e2) I have updated period_seconds different for all the metrics but still getting these Warning. Any suggestion?

avizvaRumit avatar Jul 14 '22 16:07 avizvaRumit

I think we will have to release a new version with the warning feature turned off until we can figure this out...

I'll try to get a PR ready for this during the weekend

or-shachar avatar Jul 14 '22 18:07 or-shachar

I think we will have to release a new version with the warning feature turned off until we can figure this out...

I'll try to get a PR ready for this during the weekend

@or-shachar will this resolve below Warnings as well: WARNING: CloudWatch scrape failed software.amazon.awssdk.services.cloudwatch.model.CloudWatchException: Rate exceeded (Service: CloudWatch, Status Code: 400, Request ID: aaa68484-bf4e-4fca-8d82-93d53b9134e2)

Or could you please suggest anything which can help to resolve this?

avizvaRumit avatar Jul 15 '22 12:07 avizvaRumit

@avizvaRumit This is something else related to Cloudwatch service quotas.

Probably unrelated to this issue. I suggest looking in the official Cloudwatch docs.

start a different thread for it. Please kindly describe there the amount of metrics you are scraping if you can and whether you use the use_get_metric_data option 🙏

or-shachar avatar Jul 15 '22 18:07 or-shachar

Sorry to bug you about this. Are there any plans to move forward with this?

szymonpk avatar Aug 24 '22 07:08 szymonpk

Up! Have the same problem with the AWS/ApplicationELB:HTTPCode_Target_5XX_Count metric. WARNING: (listDimensions) ignoring metric AWS/ApplicationELB:HTTPCode_Target_5XX_Count due to dimensions mismatch

Other metrics like HTTPCode_Target_4XX_Count, HTTPCode_Target_3XX_Count... works fine.

alexei-bykovski avatar Oct 13 '22 09:10 alexei-bykovski

Hi, my sincere apologies... been sucked into other stuff. I hope I'll get to finalize the work on it this weekend 🙏

or-shachar avatar Oct 13 '22 13:10 or-shachar

Up! Have the same problem with the AWS/ApplicationELB:HTTPCode_Target_5XX_Count metric. WARNING: (listDimensions) ignoring metric AWS/ApplicationELB:HTTPCode_Target_5XX_Count due to dimensions mismatch

Other metrics like HTTPCode_Target_4XX_Count, HTTPCode_Target_3XX_Count... works fine.

The same issue is also for AWS/ApplicationELB:HTTPCode_ELB_5XX_Count metric.

alexei-bykovski avatar Oct 13 '22 19:10 alexei-bykovski

Yeah, same issue here. I just assumed it was because there were no 5XX metrics to collect. But, I guess it seems to be more than that.

joebelford avatar Oct 14 '22 20:10 joebelford

Yeah, same issue here. I just assumed it was because there were no 5XX metrics to collect. But, I guess it seems to be more than that.

So, I dug into this a bit more this weekend, and I think this is just a log message question, as discussed earlier in the thread. Once I was able to generate 5XX errors for the metrics to generate, they made their way to Prometheus with no problems. For reference, I'm using OpenJDK 11 and cloudwatch_exporter-0.15.0-jar-with-dependencies.jar.

joebelford avatar Oct 17 '22 12:10 joebelford

I've completed this PR which opts out this warning message. Our first assumption is that ListMetric must always return some metrics but there are some cases where this isn't true - in which this warning is redundant

or-shachar avatar Oct 18 '22 17:10 or-shachar

As of 0.15.1, this warning is now configurable.

matthiasr avatar Oct 25 '22 08:10 matthiasr