amplify-cli icon indicating copy to clipboard operation
amplify-cli copied to clipboard

Missing Feature Flag & Error: Cannot exceed quota for PoliciesPerRole: 10

Open jakemitchellxyz opened this issue 2 years ago • 8 comments

Before opening, please confirm:

  • [X] I have installed the latest version of the Amplify CLI (see above), and confirmed that the issue still persists.
  • [X] I have searched for duplicate or closed issues.
  • [X] I have read the guide for submitting bug reports.
  • [X] I have done my best to include a minimal, self-contained set of instructions for consistently reproducing the issue.

How did you install the Amplify CLI? npm -g

If applicable, what version of Node.js are you using? irrelevant

Amplify CLI Version 5.4.0

What operating system are you using? issue has been replicated and occurs on all platforms

Amplify Categories function, api

Amplify Commands push

Describe the bug

TL;DR: We need a feature flag added to disable some broken logic that was added in v4.50.0. It was considered at the time of adding the logic, but y'all chose (inappropriately) to not include it when merging to master. We need the feature flag added ASAP because amplify push is now broken for any "large" REST APIs, and as you'll see below, you already have the code for this particular feature flag. There still needs to be discussion on the correct permanent fix, but a feature flag will at least make amplify push stable again.

There was an old issue, #6846 that got closed with no fix. This reopens that issue as a true flaw in the system, the solution given that closed that issue is not sufficient for Amplify to be considered stable.

Intro: This is a critical issue that makes it impossible for us to deploy safely (running amplify push will crash perpetually for our project and others of its size until this is fixed). We have spent quite some time analyzing the root cause and have brainstormed some longer-term solutions for your consideration in addition to the necessary hotfix of a feature flag. This issue was introduced in Amplify CLI v4.50.0, when a critical flaw was patched in amplify-provider-awscloudformation. This patch addressed issue #2084. The root cause of this new issue comes from PR #6904, the very same PR that solved #2084. Unfortunately, the solution implemented is not entirely sufficient and actually makes the situation slightly worse, and I'll explain precisely why.

In order to understand the new problem, we'll first need to understand the original problem and the strategy that @cjihrig implemented to fix that problem:

Original Problem (#2084): When you create "a lot" (>5) of Lambda functions that are exposed on the same API Gateway REST API, the permissions that Amplify attaches to the API Gateway IAM Policy are so verbose that we hit AWS's Maximum Policy Size limit of 10240 bytes. The reason it is so verbose is because it generates a block of ~30 lines of code for each method enabled for each endpoint provisioned, and it does this twice; once for the endpoint, and once with a trailing wildcard for proxying sub-routes. In other words, if you expose 1 Lambda function on 1 endpoint of an API, 10 permission blocks are generated: one for each method [get, post, put, patch, delete], and then another for each of them, but with a trailing /*. This translates to ~300 lines of code (including plenty of whitespace) for each endpoint added. It's pretty easy to see how this bottlenecks a user from creating even a medium-sized API.

Solution to the Original Problem (#6904): The solution that @cjihrig implemented is clever, but unfortunately not a perfect fix in its current implementation. His solution is to, at the time of deploying (right after the user says y to confirm an amplify push), to loop over the permissions and consolidate them into a dedicated APIGatewayAuthStack and splitting them into separate ManagedPolicies attached to this new nested stack (which attaches them to the AuthRole). The logic for this is in a file called consolidate-apigw-policies.ts. It basically just checks to see if adding a block to the policy will cause it to surpass the 10240 byte limit, and if so, it provisions a new ManagedPolicy and injects the block into that new policy instead. This solution works to avoid ever hitting the byte limit, but it completely overlooks the fact that you are only allowed to attach 10 ManagedPolicies to a single Role. Now let's discuss why this is so problematic.

The New Problem: Let's say we have 15 endpoints exposed on an API (we are hitting this issue with 12 endpoints, each pointing to a different Lambda function). Before @cjihrig's fix, this would have failed to push with "Error: Maximum policy size of 10240 bytes exceeded for role ***". After @cjihrig's fix, Amplify intelligently avoids this error, just to fail with a different error: "Error: Cannot exceed quota for PoliciesPerRole: 10".

edited_sc

Unfortunately, this is slightly worse than the original issue for one precise reason: the IAM permissions are generated AFTER confirming y to amplify push. This means that there is absolutely no room to manually correct it, like there used to be a manual fix for the original issue by editing amplify/backend/api/APIGatewayAuthStack.json and pushing again. But now that this file is overwritten at the time of deployment, manual edits are trashed by the CLI. The only possible solution for us to continue deploying is described in the "Additional Information" section below, and is not reasonable. We can get back to a stable state by simply adding a feature flag that disables this logic from running, which will allow us to keep any manual changes to amplify/backend/api/APIGatewayAuthStack.json.

The PoliciesPerRole limit can be solved on one account by requesting a service limit increase, but this is not a scalable solution, as it only solves it for one AWS account. The permanent solution to this issue is unclear, but some potential fixes and caveats are listed below to help y'all start the conversation.

Expected behavior

The Proposed Solution # 1 below (adding a feature flag) should have been implemented with the original solution. Turns out, it actually was, and then y'all decided to revert it for some unknown reason. @cjihrig's PR included the necessary Feature Flag until @attilah left a comment, mentioning @renebrandel, and effectively scaring @cjihrig into reverting the Feature Flag in a new commit. The documentation of it was also reverted.

I have no idea what conversation went on between those 3 people, but for future changes like this, if a feature flag is even considered, KEEP IT. It's extremely easy as a consumer of the Amplify CLI to watch the changelog and look at the feature flag documentation and enable things. It is extremely time consuming and painstaking to look through y'all's codebase and PRs to track down fundamental issues like this. This has screwed our progress for days, as you can clearly tell from the amount of information I've needed to aggregate to communicate this issue. Not to mention the fact that amplify push is completely broken for us at present without the feature flag. Amplify is too big now to be skipping feature flags on fragile new features. They exist for a reason.

In any case, if you prefer it to be default behavior (which is fine for tiny projects), then just reverse the feature flag, so that we can opt-out of the behavior with the feature flag instead of enabling the behavior with the feature flag. But we need the ability to opt-out, this logic is not ready for large projects, and y'all should consider the other proposed solutions (and many more) for a more permanent fix to the PoliciesPerRole Limit after implementing the feature flag.

Reproduction steps

  1. Create a REST API
  2. Create 15+ Lambda functions
  3. Expose each Lambda function on their own endpoints of the API
  4. Run amplify push and see the new policies get added to amplify/backend/api/APIGatewayAuthStack.json
  5. Add more Lambda functions as endpoints until amplify push fails with "Error: Cannot exceed quota for PoliciesPerRole: 10"

Possible Solutions to the New Problem:

Reminder: These are just cursory ideas, but hopefully it can help get the ball rolling on a permanent solution, perhaps by combining some of them together

  1. !! Introduce a Feature Flag that opts-out of the consolidation logic execution
  • Value Add: Allows the user to opt-out of ever running the consolidation logic entirely (this is what our team would prefer)
  • Caveat: Just a way to allow manual adjustments again, and doesn't solve the problem at the root
  1. Execute the consolidation logic at the time of updating the API with amplify update api instead of amplify push
  • Value Add: Re-enables the ability to manually adjust the APIGatewayAuthStack.json without being overwritten invalidly at the time of deployment
  • Value Add: Does not force the user to oversimplify their permissions because it isn't a programmatic fix like some of the ideas below
  • Caveat: Any adjustments to the API endpoints using amplify update api would override manual changes (but this is the responsibility of the developer to resolve, not the CLI)
  • Caveat: Any manual adjustments to the API Gateway CFN Templates would not trigger the consolidation logic
  1. Add an opt-in y/n question to run the consolidation logic at the time of running amplify push
  • Value Add: Doesn't force the user to use the consolidation logic and override their manual APIGatewayAuthStack.json
  • Value Add: Solves the caveat identified in solution # 2 by allowing the user to still run it at the time of deployment if they need to, without forcing them to like the current solution does.
  • Caveat: The question would get annoying for users who never hit this issue, but it must be asked every time for it to work
  1. Intelligently group the permission blocks with wildcards if all 5 methods are enabled with the same auth rules (reducing the lines of code generated per endpoint from ~300 to ~60)
  • Value Add: reduces the number of ManagedPolicies that are needed to accommodate the bytes necessary, increasing the number of functions needed before hitting the ManagedPolicies limit on the Role.
  • Value Add: "lossless" permission compression because it first checks that the methods have the same auth before consolidating them into a wildcard
  • Caveat: does not actually solve the issue, just delays it until the API is larger
  1. When the consolidation logic runs, if it detects that all functions attached to the API have the same permissions and methods, consolidate into wildcards for both methods and endpoints
  • Value Add: permanently fixes the issue for any APIs that use the same permissions and method access for all endpoints
  • Value Add: saves the user from needing to manually adjust APIGatewayAuthStack.json after every single update to the API (which is currently required)
  • Caveat: does nothing to help APIs that have a lot of endpoints with varying permissions

Additional information

The only way that we are able to push currently is by running amplify push, and then the moment after saying y to "Are you sure you want to push?", we have to very quickly open the APIGatewayAuthStack.json file, which just got overwritten by the push command, and paste the custom version into it. We save the file, and if we did it quickly enough, Amplify will correctly use our version instead of the broken code that it autogenerated. We have to do this every single time we push, making it literally impossible to use any CI/CD pipeline, and we cannot trust all of our devs to remember to do it every time, so we have to limit who is doing the deployments, to make sure they do this janky process every time. We need to disable this logic to continue using Amplify. You already have the code that solves our problem. We need that feature flag badly.

jakemitchellxyz avatar Sep 06 '21 16:09 jakemitchellxyz

Hi @jakemitchellxyz thanks for writing up this detailed report and sorry for the inconvenience it's caused. The team will evaluate adding the feature flag you suggested and will also consider your other suggestions for more long term fixes.

edwardfoyle avatar Sep 16 '21 19:09 edwardfoyle

Are there any updates on this? The "Cannot exceed quota for PoliciesPerRole: 10" error is absolutely critical and will potentially block production apps from scaling correctly.

We just put in a limit increase for 20, but that is a bandaid solution to this issue. Amplify needs to manage IAM better than it currently does, because not only does it fail to manage roles correctly, but it strips developers of all capability of manually fixing the problems themselves.

ethoman avatar Jan 16 '22 03:01 ethoman

Hi there, is there any updates on this? Like ethoman stated, this problem is critical, and currently completely blocking for us.

ldaudet avatar Feb 01 '22 14:02 ldaudet

Having the same problem and limit with:

  • 2 API Gateway
  • GraphQL (55 tables)

In graphql auth, we have on all tables: {allow: private, provider: iam, operations: [read, create, update, delete]} Amplify is creating 10 policies (splitting table access accross them) and it cannot be attached to the authRole as there is other policies to attach too...

I think the problem is that graphql-auth-transformer.ts is not checking if there are already other attached policies to the auth user. In my case, one api gateway is already attaching a policy to the auth user.

SebSchwartz avatar Feb 08 '22 15:02 SebSchwartz

I asked for quota update and now I have this error: Cannot exceed quota for PoliciesPerRole: 20

So I have a project where I cannot do anything, I cannot update my version and our production is stucked...

@ammarkarachi @lazpavel

SebSchwartz avatar Feb 10 '22 08:02 SebSchwartz

Just got this error trying to upgrade to graphqltransformer v2. Now, we're dead in the water. This is so frustrating.

CodySwannGT avatar Apr 21 '22 22:04 CodySwannGT

Same as @CodySwannGT, did an upgrade to graphqltransformer v2 and then we are stuck with this error Cannot exceed quota for PoliciesPerRole: 10. When I check the role in the cloudformation template (amplify/backend/api/XXXXX/build/cloudformation-template.json) one of my role contains more 43 resources

      "PolicyDocument": {
          "Statement": [
            {
              "Action": "appsync:GraphQL",
              "Effect": "Allow",
              "Resource": [

thibaultdalban avatar Apr 29 '22 17:04 thibaultdalban

Any updates on this issue?

ArturoTorresMartinez avatar Sep 16 '22 18:09 ArturoTorresMartinez

Hey folks :wave: this issue has since been fixed. Closing 🙂

josefaidt avatar Sep 22 '22 16:09 josefaidt

@josefaidt Was this fixed only for APIs using APIGateway?

Our team is running into the exact same problem as @CodySwannGT and @thibaultdalban when migrating to GraphQL transformer v2 due to the service quota having a maximum of 20 (even after requested increase).

We are using version 10.5.1 and have around 40 models with multiple auth rules per model.

While we might be able to work around this by removing/consolidating auth rules and removing queries/mutations we are pretty much completely blocked from scaling our application if we were to migrate to v2.

antantonton avatar Dec 02 '22 16:12 antantonton