terraform-aws-control_tower_account_factory
terraform-aws-control_tower_account_factory copied to clipboard
AFT-Invoke-Customizations concurrency issue
Terraform Version 1.2.4
AFT Version: 1.5.2
Bug Description There's a concurrency issue in the step function [aft-invoke-customizations]. When multiple accounts are invoked, the step 'Invoke Provisioning Framework' is invoked with a concurrency of 25 but it seems the python lambda called behind don't properly handle the retries of API Call (probably config issue with botocore session) and it crash the whole process. To have it fix I had to go with a concurrency of 1 which it makes the process very slow
To Reproduce Steps to reproduce the behavior:
- Prerequisites : several accounts must have been provisionned in AFT (on our side 20+)
- Go to step function 'aft-invoke-customization'
- Click on 'New Execution'
- Type input : {"include" : [{"type" : "all"}]} to run all accounts
- See error
Expected behavior
Related Logs Provide any related logs or error messages to help explain your problem.
Additional context {"errorMessage":"An error occurred (TooManyRequestsException) when calling the ListAccounts operation (reached max retries: 4): AWS Organizations can't complete your request because another request is already in progress. Try again later.","errorType":"TooManyRequestsException","stackTrace":[" File "/var/task/aft_account_provisioning_framework_get_account_info.py", line 34, in lambda_handler\n return get_account_info(\n"," File "/opt/python/lib/python3.8/site-packages/aft_common/account_provisioning_framework.py", line 250, in get_account_info\n account_id = utils.get_account_id_from_email(ct_management_session, email)\n"," File "/opt/python/lib/python3.8/site-packages/aft_common/aft_utils.py", line 169, in get_account_id_from_email\n for page in paginator.paginate():\n"," File "/opt/python/lib/python3.8/site-packages/botocore/paginate.py", line 253, in iter\n response = self._make_request(current_kwargs)\n"," File "/opt/python/lib/python3.8/site-packages/botocore/paginate.py", line 332, in _make_request\n return self._method(**current_kwargs)\n"," File "/opt/python/lib/python3.8/site-packages/botocore/client.py", line 415, in _api_call\n return self._make_api_call(operation_name, kwargs)\n"," File "/opt/python/lib/python3.8/site-packages/botocore/client.py", line 745, in _make_api_call\n raise error_class(parsed_response, operation_name)\n"]}
Hi there, thanks for bringing this to our attention. We've got a backlog item that we're working through internally to address this issue.
Hi team, any update on this one ?
Hi @sl-ediquas,
We don't have an ETA for this bug fix at this time.
Is it possible to raise this concurrency post the hard-coded 25? Or rather, set it to 0 and let SFN go as fast as it can?
Executing baselines on 200+ accounts takes many hours at this stage even for what is a no-op from customizations perspective.
Just hit the same brick wall, I like the suggestion by @kadrach. We are only at 130ish accounts
Dealing with the same issue 35 + accounts
Hi team,
I'd like to bring your attention to this bug and share my experience when we want to make a deployment.
We typically deploy on batches per environment (dev, pre and pro) to reduce the impact in case something goes wrong and in some situations we deploy to all accounts in the organizations. As the number of accounts in our organizations started to grow we hit this bug when deploying to all the accounts and managed to work around the issue by re-trying the SF invocation. The number of accounts has kept growing and we have reached a situation where we can't event deploy per environment because it always fails.
Thanks, Francisco
Hey @fjromerom,
Thanks for continuing to raise attention here. While we don't have a firm ETA at this time, we're aware of the issue and are investigating fixes.
Hi all,
Thank you for your patience regarding this issue. We've just released 1.7.0, which optimizes the network calls made by the aft-account-provisioning-framework
step function to address the reported throttling errors. Everything looks good in internal testing - however, as every environment is different, we're leaving this issue open to collect additional feedback.
If you were experiencing this problem, please upgrade your environment to 1.7.0 and let us know via this issue if you continue to face any errors during concurrent customization executions.
We haven't received any reports of continuing concurrency issues, so I'm going to go ahead and close this issue as completed. Please feel free to open new bug reports as required.