workload-discovery-on-aws icon indicating copy to clipboard operation
workload-discovery-on-aws copied to clipboard

"Import is already running."

Open cpr-case opened this issue 3 years ago • 38 comments

Describe the bug

I followed the deployment guide on the web, and all seemed ok, but about 24 hours ago discovery abruptly began failing on all my target accounts. The error message is cryptic but reasonably descriptive:

{ "message": "Stopping import as import is already running.", "level": "info", "timestamp": "2021-09-23T15:16:25.503Z" }

Deleting and re-importing the accounts and regions did not help.

This seems to stop the discovery process cold. I don't know what caused it, so I'm not sure how to reproduce it.

This happens across multiple browsers; I don't believe it's browser-related. aws-perspective-error-log.txt

I'm just getting started with Perspective, so if the answer is "tear it down and do it again," that's ok.

cpr-case avatar Sep 24 '21 13:09 cpr-case

When you go into ECS console is there a task still running? The discovery process does a check at the start and won't run if another one is still in progress. Also, can you give me a rough idea how many resources are in these accounts? Did any resources get discovered?

svozza avatar Sep 24 '21 13:09 svozza

When I look at the cluster in the ECS console, there is one fargate task running, but it started this morning at 8:30.

I was successful in discovering maybe 1500 resources. I only tried to discover resources from a couple of target accounts.

I'm sure I interrupted some running process, I'm just not sure what.

cpr-case avatar Sep 24 '21 13:09 cpr-case

Did you add any new accounts just before it stopped working? The reason I ask is that we have seen issues with rate limiting from AWS Config if you add multiple accounts at the same time. It seems to be that if Perspective tries to add around ~3000 resources all at once, the rate limiting kicks in.

svozza avatar Sep 24 '21 13:09 svozza

I may have; I added one new account but am not sure of the exact time.

cpr-case avatar Sep 24 '21 13:09 cpr-case

I am almost sure it's this issue, can you remove the last account you added and kill the ECS task and then see if it runs again. Don't uninstall any of the CloudFormation you installed, i.e., global-resources.template and regional-resources.template, just remove the account/region from the UI in Perspective.

svozza avatar Sep 24 '21 13:09 svozza

OK, I removed the account and stopped the task. It's moved to "pending" now.

cpr-case avatar Sep 24 '21 14:09 cpr-case

Same error message.

cpr-case avatar Sep 24 '21 14:09 cpr-case

Are there errors in the logs for the ECS task that is still running or has it just hung?

svozza avatar Sep 24 '21 14:09 svozza

The ecs task is running, but the only log entry is identical to the one I originally supplied above. I'm not sure which check for something already running is failing.

cpr-case avatar Sep 24 '21 14:09 cpr-case

Ah sorry, I thought there was a task that was running that was working and other ones were failing with that error. So are you saying that the task has that error but continues to run? I've not seen this before, the discovery process makes a call to the ECS listTasks to see if any other tasks are running and fails if there's already a copy running. I don't understand how this check is getting trigger if the task producing the error is the only one running. What outputdo you get if you run the follwoing cli command:

aws ecs list-tasks --cluster aws-perspective-<acoount-number>-cluster

svozza avatar Sep 24 '21 15:09 svozza

$ aws ecs list-tasks --cluster aws-perspective-300128025978-cluster { "taskArns": [ "arn:aws:ecs:us-east-1:300128025978:task/aws-perspective-300128025978-cluster/4c15c0cd0e2845f7831041450ad0842f", "arn:aws:ecs:us-east-1:300128025978:task/aws-perspective-300128025978-cluster/760ea23e97054676ae8dd79067f32a5f" ] }

So there do appear to be two tasks running.

cpr-case avatar Sep 24 '21 15:09 cpr-case

Maybe just go into the ECS console and click Stop All, that should get any.

svozza avatar Sep 24 '21 15:09 svozza

Also, if you alter the task definition to add an environment variable called LOG_LEVEL and set it to DEBUG we can get more info the next time it runs.

svozza avatar Sep 24 '21 15:09 svozza

I stopped all, but it only reported on stopping the second one.

cpr-case avatar Sep 24 '21 15:09 cpr-case

So is the other one still running? Maybe try to kill them using the cli?

svozza avatar Sep 24 '21 16:09 svozza

So both tasks ended up getting stopped. The restarted task ran successfully (as far as I can tell; the "last scanned" column in the web UI region list remains "N/A") at 11:45 and 12:00.

However, the 12:15 run reverted to the same error message, though there is only one task running according to the cli.

I was not at the computer then, so I know I didn't change any of the parameters. :-)

cpr-case avatar Sep 24 '21 17:09 cpr-case

This is very strange, what region are you running Perspective in?

svozza avatar Sep 24 '21 18:09 svozza

us-east-1

cpr-case avatar Sep 24 '21 18:09 cpr-case

OK. I deleted the entire stack (which left the ecs task provisioned, just so you know -- you have to deregister the old one after you reinstall and a new task is defined), reinstalled it, and tried to add discovery for resources from three AWS accounts. That failed after a few discovery cycles; it seems that the throttling phenomenon you suspected was on the mark.

I killed all the running ECS tasks, removed the regions from all but one AWS account, and the next discovery ran without reporting errors in cloudwatch.

It was our sandbox account, without that many resources (337). If that runs successfully the rest of today, I'll add one more account and see what happens there.

cpr-case avatar Sep 27 '21 20:09 cpr-case

Yeah, ~300 resources is fine, I normally test in an account with ~2k, it's when you get to the double that that this throttlling stuff seems to happen.

svozza avatar Sep 28 '21 10:09 svozza

If that number of resources is fine, then there is some problem with state management or error handling.

This just happened this morning. I cleaned things up, killed running tasks, etc. The first discovery after I did that interrogated one account for two regions. The region with (many) fewer resources finished successfully after about three minutes:

{ "level": "info", "message": "<<<<<<<<<< import of config complete for account: 139212093688 region: us-east-2 finished!", "timestamp": "2021-09-28T13:48:55.060Z" }

The region with the bulk of the resources threw a few errors like this:

"(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 7)"

"(node:1) UnhandledPromiseRejectionWarning: TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer, ArrayBuffer, or Array or an Array-like Object. Received undefined"

resulting in

{ "level": "error", "message": "Error Message: You have specified a resource that is either unknown or has not been discovered.", "timestamp": "2021-09-28T13:48:51.890Z" }

and then goes off into never-never land.

That task is still running, and nothing else runs while it is.

cpr-case avatar Sep 28 '21 15:09 cpr-case

After some time, you just get a stream of error messages:

{ "level": "error", "message": "Error Message: The security token included in the request is expired", "timestamp": "2021-09-28T15:13:37.497Z" }

maybe once every 45 seconds to a minute.

cpr-case avatar Sep 28 '21 15:09 cpr-case

Finally, after about two hours, the first region completes:

{ "level": "info", "message": "<<<<<<<<<< import of config complete for account: 139212093688 region: us-east-1 finished!", "timestamp": "2021-09-28T15:46:55.303Z" }

It then attempts to go off and process data from other accounts -- even though I removed them from the discovered regions using the Web UI -- removes them from the ElasticSearch store, and finishes about 15 minutes later.

The discovery process eventually recovers, in the sense that another task starts running. I guess it's working, in that the lengthy discovery processes eventually complete with (at this moment) 854 unique resources -- including resources that don't belong to the account being discovered. I guess the discovery process didn't remove them from the database?

cpr-case avatar Sep 28 '21 18:09 cpr-case

There's still something weird going on here, it shouldn't take 2 hours to discover 850 resources. Regarding the other accounts that have been removed in the UI, I wonder are there still resources hanging around in the Config Aggregator (it will be called aws-perspective-<region>-<account-no>-aggregator). Can you check in the Config console and see if it still has the other accounts, maybe the call to Config to remove them failed.

svozza avatar Sep 28 '21 18:09 svozza

Yes, the aggregator "knows" about three accounts and it's going off and trying to update resources in a ton of regions: me-south-1, eu-west-3, us-west-1, etc. I have no idea why it's trying those regions.

cpr-case avatar Sep 28 '21 18:09 cpr-case

So, looking at the cfn that sets up the aggregator in the account where Perspective lives: https://github.com/awslabs/aws-perspective/blob/8a75b2e5314f57a22556df51b5dd9191574e68b3/source/cfn/templates/zoom-import-and-aggregator.yaml#L18 it looks like the AllRegions option is enabled but that doesn't explain why it's enabled for the other accounts. It should only affect the Perspective account.

svozza avatar Sep 28 '21 18:09 svozza

If you edit the aggregator to remove the extra regions that should definitely help with the throttling, especially if you've a bunch of resources in those other regions you don't care about. Also, we can look at changing that cfn template to only use the region Perspective is installed in initially.

svozza avatar Sep 29 '21 11:09 svozza

I rebuilt with only one account, two regions, removed the rest of the unused regions from the aggregator, and a discovery import now takes about 30-45 minutes to discover approximately 350 resources.

I get a ton of error messages from the task that look like

{ "level": "error", "message": "Error linking 9b140fac7eab812d19be7a6b312938ee to e045b5fa500e03bbc628012018710c2e with label AWS::TAGS::TAG", "timestamp": "2021-09-29T16:38:03.723Z" }

so that might be extending the time it takes.

cpr-case avatar Sep 29 '21 18:09 cpr-case

Can you grab the whole logs for that discovery task that took 30-45 mins and attach them here, maybe something might jump out at me. You'll probably have to get them direct from the CloudWatch Logs console rather than the ECS one.

svozza avatar Sep 29 '21 18:09 svozza

It ran from 12:01:16 to 12:38:07. Is there an easy way to download all events from that one log stream?

There are 3900 events in less than a minute of elapsed time, and I got tired of hitting "load newer events".

cpr-case avatar Sep 29 '21 19:09 cpr-case