terraform-aws-control_tower_account_factory icon indicating copy to clipboard operation
terraform-aws-control_tower_account_factory copied to clipboard

AFT-Invoke-Customizations concurrency issue

Open sl-ediquas opened this issue 2 years ago • 3 comments

Terraform Version 1.2.4

AFT Version: 1.5.2

Bug Description There's a concurrency issue in the step function [aft-invoke-customizations]. When multiple accounts are invoked, the step 'Invoke Provisioning Framework' is invoked with a concurrency of 25 but it seems the python lambda called behind don't properly handle the retries of API Call (probably config issue with botocore session) and it crash the whole process. To have it fix I had to go with a concurrency of 1 which it makes the process very slow

To Reproduce Steps to reproduce the behavior:

  1. Prerequisites : several accounts must have been provisionned in AFT (on our side 20+)
  2. Go to step function 'aft-invoke-customization'
  3. Click on 'New Execution'
  4. Type input : {"include" : [{"type" : "all"}]} to run all accounts
  5. See error

Expected behavior

Related Logs Provide any related logs or error messages to help explain your problem.

Additional context {"errorMessage":"An error occurred (TooManyRequestsException) when calling the ListAccounts operation (reached max retries: 4): AWS Organizations can't complete your request because another request is already in progress. Try again later.","errorType":"TooManyRequestsException","stackTrace":[" File "/var/task/aft_account_provisioning_framework_get_account_info.py", line 34, in lambda_handler\n return get_account_info(\n"," File "/opt/python/lib/python3.8/site-packages/aft_common/account_provisioning_framework.py", line 250, in get_account_info\n account_id = utils.get_account_id_from_email(ct_management_session, email)\n"," File "/opt/python/lib/python3.8/site-packages/aft_common/aft_utils.py", line 169, in get_account_id_from_email\n for page in paginator.paginate():\n"," File "/opt/python/lib/python3.8/site-packages/botocore/paginate.py", line 253, in iter\n response = self._make_request(current_kwargs)\n"," File "/opt/python/lib/python3.8/site-packages/botocore/paginate.py", line 332, in _make_request\n return self._method(**current_kwargs)\n"," File "/opt/python/lib/python3.8/site-packages/botocore/client.py", line 415, in _api_call\n return self._make_api_call(operation_name, kwargs)\n"," File "/opt/python/lib/python3.8/site-packages/botocore/client.py", line 745, in _make_api_call\n raise error_class(parsed_response, operation_name)\n"]}

sl-ediquas avatar Aug 17 '22 07:08 sl-ediquas

Hi there, thanks for bringing this to our attention. We've got a backlog item that we're working through internally to address this issue.

adam-daily avatar Aug 18 '22 16:08 adam-daily

Hi team, any update on this one ?

sl-ediquas avatar Sep 05 '22 08:09 sl-ediquas

Hi @sl-ediquas,

We don't have an ETA for this bug fix at this time.

stumins avatar Sep 09 '22 21:09 stumins

Is it possible to raise this concurrency post the hard-coded 25? Or rather, set it to 0 and let SFN go as fast as it can?

Executing baselines on 200+ accounts takes many hours at this stage even for what is a no-op from customizations perspective.

kadrach avatar Sep 28 '22 10:09 kadrach

Just hit the same brick wall, I like the suggestion by @kadrach. We are only at 130ish accounts

pmmalinov01 avatar Oct 06 '22 06:10 pmmalinov01

Dealing with the same issue 35 + accounts

alvarado-fabian avatar Nov 10 '22 14:11 alvarado-fabian

Hi team,

I'd like to bring your attention to this bug and share my experience when we want to make a deployment.

We typically deploy on batches per environment (dev, pre and pro) to reduce the impact in case something goes wrong and in some situations we deploy to all accounts in the organizations. As the number of accounts in our organizations started to grow we hit this bug when deploying to all the accounts and managed to work around the issue by re-trying the SF invocation. The number of accounts has kept growing and we have reached a situation where we can't event deploy per environment because it always fails.

Thanks, Francisco

fjromerom avatar Nov 30 '22 12:11 fjromerom

Hey @fjromerom,

Thanks for continuing to raise attention here. While we don't have a firm ETA at this time, we're aware of the issue and are investigating fixes.

stumins avatar Dec 02 '22 22:12 stumins

Hi all,

Thank you for your patience regarding this issue. We've just released 1.7.0, which optimizes the network calls made by the aft-account-provisioning-framework step function to address the reported throttling errors. Everything looks good in internal testing - however, as every environment is different, we're leaving this issue open to collect additional feedback.

If you were experiencing this problem, please upgrade your environment to 1.7.0 and let us know via this issue if you continue to face any errors during concurrent customization executions.

stumins avatar Dec 13 '22 18:12 stumins

We haven't received any reports of continuing concurrency issues, so I'm going to go ahead and close this issue as completed. Please feel free to open new bug reports as required.

stumins avatar Jan 07 '23 02:01 stumins