yet-another-cloudwatch-exporter icon indicating copy to clipboard operation
yet-another-cloudwatch-exporter copied to clipboard

[BUG] large amount of SQS queues causes missing metrics

Open OliverKlette85 opened this issue 4 years ago • 16 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

We have around 9,5 k SQS queues in the eu-west-1 region of one of our prod accounts, but the YACE exporter only provides metrics for around 5 K of them.

image

image

I already tried to run several YACE instances in parallel:

  • split by SQS metrics
  • split by searchTags

Both didn't improve the situation. I also requested quota increase of AWS quotas for GetMetricData (1000 per second) and ListMetrics (100 per second) requests and according to AWS monitoring we are far away from reaching it.

In the YACE debug log I couldn't find any entries which explain the missing metrics.

Expected Behavior

The exporter should provide metrics of all SQS queue (it worked with official Cloudwatch exporter)

Steps To Reproduce

config:

extraArgs:
  scraping-interval: 120
  debug: true

config: |-
  discovery:
    exportedTagsOnMetrics:
      sqs:
        - dh_app
        - dh_country
        - dh_env
        - dh_platform
        - dh_region
        - dh_squad
        - dh_tribe
    jobs:
    - type: sqs
      regions:
        - eu-west-1
      delay: 600
      period: 120
      length: 120
      awsDimensions:
       - QueueName
      metrics:
        - name: ApproximateAgeOfOldestMessage
          statistics:
          - Average

Anything else?

No response

OliverKlette85 avatar Dec 02 '21 16:12 OliverKlette85

Wow thats a hugh amount of SQS queues 🎉

We need more debugging logs to find the error here. Will try to add the debugging in the next seven days.

Alternative provide me A SEPARATE AWS account with 6k SQS already created to debug this:

Cross account sharing via: arn:aws:iam::838758336246:user/debug-yace-489

Don't forget to add the permissions for this user as well:

"tag:GetResources",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics"

Will debug it in the next 7 days.

thomaspeitz avatar Dec 03 '21 14:12 thomaspeitz

Hi Thomas, thanks for your quick reaction.

I created a role with the desired permissions and a trust policy for your user in our stg account:

arn:aws:iam::487596255802:role/yace_debug

This account has actually over 12 k SQS queues in the eu-west-1 region :)

Please ping if you need anything from my side.

OliverKlette85 avatar Dec 05 '21 20:12 OliverKlette85

{"level":"info","msg":"Couldn't describe resources for region eu-west-1: AccessDeniedException: User: arn:aws:sts::487596255802:assumed-role/yace_debug/1638822531663485000 is not authorized to perform: tag:GetResources because no identity-based policy allows the tag:GetResources action\n\tstatus code: 400, request id: f6e26e95-0a6d-4632-b2a6-052425ceeeff\n","time":"2021-12-06T21:28:52+01:00"}

Could you double check on your side that everything is configured correctly? Seems I can switch succesfully into the role but missing the permissions:

Don't forget to add the permissions for this user as well:

"tag:GetResources",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics"

thomaspeitz avatar Dec 06 '21 20:12 thomaspeitz

Yeah I set these rights. Maybe the issue was caused by the condition I set on policy level. I moved it to the trust relationship. Could you please test again?

OliverKlette85 avatar Dec 07 '21 10:12 OliverKlette85

Okay, I see the issue. Now I have something to debug with. Had two small things in mind which were not the issue.

Need to dig deeper.

Greetings, Thomas :)

thomaspeitz avatar Dec 08 '21 17:12 thomaspeitz

Thanks for the efforts! Please let me know if you need something from my side.

Grüße aus Berlin :)

OliverKlette85 avatar Dec 08 '21 20:12 OliverKlette85

Grüße zurück aus Aachen (aktuell).

FYI: Was not able to put much time into it (and did not find anything yet) and will only work on this at the end of this week again due to private stuff. - Would be nice if you keep the role active so I can debug it further. - Still not understanding what happens their. Was expecting pagination bugs which does not seem to be the problem.

thomaspeitz avatar Dec 13 '21 08:12 thomaspeitz

Thanks for the update. I will keep the role active.

OliverKlette85 avatar Dec 13 '21 12:12 OliverKlette85

Hi @thomaspeitz did you manage to have another look?

OliverKlette85 avatar Dec 20 '21 09:12 OliverKlette85

Hi! We're facing the same issue, we have around 350 queues and some of them are entirely ignored. Reverting back to old cloudwatch exporter fixes the issue. We're using version: v0.28.0-alpha

endyrocket avatar Dec 22 '21 16:12 endyrocket

Hi, we are having the same issue, happy to see that it was reported already :)

eusokolov avatar Dec 27 '21 12:12 eusokolov

Sorry was doing vacation. Back again.

@OliverKlette85 if you have the IAM still configured I will take again a look on this topic this week.

thomaspeitz avatar Jan 16 '22 13:01 thomaspeitz

Yes it is still active.

OliverKlette85 avatar Jan 17 '22 14:01 OliverKlette85

Thats gold worth to know "Reverting back to old cloudwatch exporter fixes the issue. We're using version: v0.28.0-alpha" @endyrocket - Thanks! - Makes it easier to debug.

Awesome thank you @OliverKlette85 will work on that probably Thuersday / Friday.

Currently my active work on the project is cut to 2h a week due to no revenue generation from the project. Will try my best to fix this but it is top of the list (at least) to get fixed.

thomaspeitz avatar Jan 17 '22 14:01 thomaspeitz

I've been experiencing the same issue and I think I found a data point. All of my missing SQS queues don't have any tags applied to them. As soon as I applied one tag (anything), a few minutes later the metrics would start showing up in YACE.

I think the issue is that this API call to resourcegroupstaggingapi/get-resources used to return all SQS-type resources, but now AWS is only returning those that have been tagged.

{"ResourceTypeFilters":["sqs"],"ResourcesPerPage":100}

Maybe there's another way to get the list of resources to query? Or just tag all of your SQS queues with something arbitrary

mmanjos avatar Jul 19 '22 16:07 mmanjos

@mmanjos I can confirm that adding tags to SQS queue solves this issue.

I had the same problem with my queues not being visible in exported metrics. Turned out those queues had 0 tags on them. After adding arbitrary tag I was able to query metric.

nickyfoster avatar Nov 16 '23 21:11 nickyfoster