workload-discovery-on-aws
workload-discovery-on-aws copied to clipboard
"Import is already running."
Describe the bug
I followed the deployment guide on the web, and all seemed ok, but about 24 hours ago discovery abruptly began failing on all my target accounts. The error message is cryptic but reasonably descriptive:
{ "message": "Stopping import as import is already running.", "level": "info", "timestamp": "2021-09-23T15:16:25.503Z" }
Deleting and re-importing the accounts and regions did not help.
This seems to stop the discovery process cold. I don't know what caused it, so I'm not sure how to reproduce it.
This happens across multiple browsers; I don't believe it's browser-related. aws-perspective-error-log.txt
I'm just getting started with Perspective, so if the answer is "tear it down and do it again," that's ok.
When you go into ECS console is there a task still running? The discovery process does a check at the start and won't run if another one is still in progress. Also, can you give me a rough idea how many resources are in these accounts? Did any resources get discovered?
When I look at the cluster in the ECS console, there is one fargate task running, but it started this morning at 8:30.
I was successful in discovering maybe 1500 resources. I only tried to discover resources from a couple of target accounts.
I'm sure I interrupted some running process, I'm just not sure what.
Did you add any new accounts just before it stopped working? The reason I ask is that we have seen issues with rate limiting from AWS Config if you add multiple accounts at the same time. It seems to be that if Perspective tries to add around ~3000 resources all at once, the rate limiting kicks in.
I may have; I added one new account but am not sure of the exact time.
I am almost sure it's this issue, can you remove the last account you added and kill the ECS task and then see if it runs again. Don't uninstall any of the CloudFormation you installed, i.e., global-resources.template
and regional-resources.template
, just remove the account/region from the UI in Perspective.
OK, I removed the account and stopped the task. It's moved to "pending" now.
Same error message.
Are there errors in the logs for the ECS task that is still running or has it just hung?
The ecs task is running, but the only log entry is identical to the one I originally supplied above. I'm not sure which check for something already running is failing.
Ah sorry, I thought there was a task that was running that was working and other ones were failing with that error. So are you saying that the task has that error but continues to run? I've not seen this before, the discovery process makes a call to the ECS listTasks
to see if any other tasks are running and fails if there's already a copy running. I don't understand how this check is getting trigger if the task producing the error is the only one running. What outputdo you get if you run the follwoing cli command:
aws ecs list-tasks --cluster aws-perspective-<acoount-number>-cluster
$ aws ecs list-tasks --cluster aws-perspective-300128025978-cluster { "taskArns": [ "arn:aws:ecs:us-east-1:300128025978:task/aws-perspective-300128025978-cluster/4c15c0cd0e2845f7831041450ad0842f", "arn:aws:ecs:us-east-1:300128025978:task/aws-perspective-300128025978-cluster/760ea23e97054676ae8dd79067f32a5f" ] }
So there do appear to be two tasks running.
Maybe just go into the ECS console and click Stop All
, that should get any.
Also, if you alter the task definition to add an environment variable called LOG_LEVEL
and set it to DEBUG
we can get more info the next time it runs.
I stopped all, but it only reported on stopping the second one.
So is the other one still running? Maybe try to kill them using the cli?
So both tasks ended up getting stopped. The restarted task ran successfully (as far as I can tell; the "last scanned" column in the web UI region list remains "N/A") at 11:45 and 12:00.
However, the 12:15 run reverted to the same error message, though there is only one task running according to the cli.
I was not at the computer then, so I know I didn't change any of the parameters. :-)
This is very strange, what region are you running Perspective in?
us-east-1
OK. I deleted the entire stack (which left the ecs task provisioned, just so you know -- you have to deregister the old one after you reinstall and a new task is defined), reinstalled it, and tried to add discovery for resources from three AWS accounts. That failed after a few discovery cycles; it seems that the throttling phenomenon you suspected was on the mark.
I killed all the running ECS tasks, removed the regions from all but one AWS account, and the next discovery ran without reporting errors in cloudwatch.
It was our sandbox account, without that many resources (337). If that runs successfully the rest of today, I'll add one more account and see what happens there.
Yeah, ~300 resources is fine, I normally test in an account with ~2k, it's when you get to the double that that this throttlling stuff seems to happen.
If that number of resources is fine, then there is some problem with state management or error handling.
This just happened this morning. I cleaned things up, killed running tasks, etc. The first discovery after I did that interrogated one account for two regions. The region with (many) fewer resources finished successfully after about three minutes:
{ "level": "info", "message": "<<<<<<<<<< import of config complete for account: 139212093688 region: us-east-2 finished!", "timestamp": "2021-09-28T13:48:55.060Z" }
The region with the bulk of the resources threw a few errors like this:
"(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict
(see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 7)"
"(node:1) UnhandledPromiseRejectionWarning: TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer, ArrayBuffer, or Array or an Array-like Object. Received undefined"
resulting in
{ "level": "error", "message": "Error Message: You have specified a resource that is either unknown or has not been discovered.", "timestamp": "2021-09-28T13:48:51.890Z" }
and then goes off into never-never land.
That task is still running, and nothing else runs while it is.
After some time, you just get a stream of error messages:
{ "level": "error", "message": "Error Message: The security token included in the request is expired", "timestamp": "2021-09-28T15:13:37.497Z" }
maybe once every 45 seconds to a minute.
Finally, after about two hours, the first region completes:
{ "level": "info", "message": "<<<<<<<<<< import of config complete for account: 139212093688 region: us-east-1 finished!", "timestamp": "2021-09-28T15:46:55.303Z" }
It then attempts to go off and process data from other accounts -- even though I removed them from the discovered regions using the Web UI -- removes them from the ElasticSearch store, and finishes about 15 minutes later.
The discovery process eventually recovers, in the sense that another task starts running. I guess it's working, in that the lengthy discovery processes eventually complete with (at this moment) 854 unique resources -- including resources that don't belong to the account being discovered. I guess the discovery process didn't remove them from the database?
There's still something weird going on here, it shouldn't take 2 hours to discover 850 resources. Regarding the other accounts that have been removed in the UI, I wonder are there still resources hanging around in the Config Aggregator (it will be called aws-perspective-<region>-<account-no>-aggregator
). Can you check in the Config console and see if it still has the other accounts, maybe the call to Config to remove them failed.
Yes, the aggregator "knows" about three accounts and it's going off and trying to update resources in a ton of regions: me-south-1, eu-west-3, us-west-1, etc. I have no idea why it's trying those regions.
So, looking at the cfn that sets up the aggregator in the account where Perspective lives: https://github.com/awslabs/aws-perspective/blob/8a75b2e5314f57a22556df51b5dd9191574e68b3/source/cfn/templates/zoom-import-and-aggregator.yaml#L18 it looks like the AllRegions
option is enabled but that doesn't explain why it's enabled for the other accounts. It should only affect the Perspective account.
If you edit the aggregator to remove the extra regions that should definitely help with the throttling, especially if you've a bunch of resources in those other regions you don't care about. Also, we can look at changing that cfn template to only use the region Perspective is installed in initially.
I rebuilt with only one account, two regions, removed the rest of the unused regions from the aggregator, and a discovery import now takes about 30-45 minutes to discover approximately 350 resources.
I get a ton of error messages from the task that look like
{ "level": "error", "message": "Error linking 9b140fac7eab812d19be7a6b312938ee to e045b5fa500e03bbc628012018710c2e with label AWS::TAGS::TAG", "timestamp": "2021-09-29T16:38:03.723Z" }
so that might be extending the time it takes.
Can you grab the whole logs for that discovery task that took 30-45 mins and attach them here, maybe something might jump out at me. You'll probably have to get them direct from the CloudWatch Logs console rather than the ECS one.
It ran from 12:01:16 to 12:38:07. Is there an easy way to download all events from that one log stream?
There are 3900 events in less than a minute of elapsed time, and I got tired of hitting "load newer events".