workload-discovery-on-aws
workload-discovery-on-aws copied to clipboard
Errors importing Regions into Perspective
Describe the bug
I've set up Perspective in one account and used StackSets to push out the global and regional resources to the other accounts in our Org.
Initially I imported 2 accounts/regions and things looked good.
I then asked Perspective to import the additional 20 or so accounts in our Org, and nothing seemed to happen. The "Last Scanned" column in the Regions table other than for the 2 original accounts never changed from "N/A". Further digging, I found the ECS scheduled task seemingly hanging.
To Reproduce
I've seen other issues mention problems bulk importing regions, so I have
- Removed the failed Regions.
- Stopped the ECS discovery task.
- Added 1 additional region at a time.
I've had mixed success with this. ( I now have 4 regions successfully imported ). However, I can't get regions to successfully import consistently, and currently the 4 regions I have are not being updated by the scheduled task.
I have updated the ECS task and set LOG_LEVEL to DEBUG and attached the latest log events from the task.
Due to the way discovery process works when you add a large amount of accounts at once the calls made to AWS Config can get throttled and due to retries with exponential back off may end up hanging the whole process. This is mentioned in the documentation here with remediation steps. We have also recently noticed that if single account has ~3k resources in it that the discovery process can hang too but unfortunately the fix to this will require a rewrite of how AWS Perspective uses Config. We are currently in the early stages of this refactor but I can't give a date for when it will be complete just yet.
Hi. Many thanks for looking into this.
It's been 36 hours or so since I removed regions that I added in bulk and reverted back to just having the original accounts which were successfully imported. All of those accounts have less than 2000 config resources. I am still not seeing consistent successful scans though. The last successful scan was 4 hours ago. ( I should be seeing successful scan every 15 minutes right? )
The scan ECS task from that time is still running. I've attached the logs. I can see ThrottlingExceptions in there, but should I still be getting throttled 36 hours later when now all I am doing is updating successfully imported regions?
1637127990974,"{""level"":""debug"",""message"":""Error code for listAggregateDiscoveredResources: ThrottlingException"",""timestamp"":""2021-11-17T05:46:30.974Z""}"
This issue is a scaling limitation with v1.x.x as the discovery process made an API call to AWS Config for every resource it found. For regions with ~2500 resources, this meant Config would start applying rate limiting to these calls and eventually the whole discovery process would hang and not complete. We have addressed this issue in version v2.0.0 and now make far fewer calls to Config when discovering resources. This new version was released today, and the 2500 resource limit no longer applies.