aws-cdk icon indicating copy to clipboard operation
aws-cdk copied to clipboard

(stepfunctions): CDK generated stepfunction roles breaking inflight stepfunction executions with versioned lambdas

Open nsaman opened this issue 3 years ago • 8 comments

What is the problem?

Using stepfunction auto generate of stepfunction roles and also use versioned lambdas in the step functions. On deployment, the stepfunction role is updated with the new lambda version. This causes invoke:lambda role failures in in-flight stepfunction executions as they will have the previous lambda version in their stepfunction execution definition but will now have the newer lambda version in the stepfunction role.

Is there way to have stepfunction auto generated roles to not include the lambda version in the role?

Reproduction Steps

Create a stepfunction that invokes a lambda version. The stepfunction role will contain a lambda version

What did you expect to happen?

Stepfunctions to not fail on inflight executions during a deployment

What actually happened?

Stepfunction lambda:invoke errors on mismatched lambda versions: Error

Lambda.AWSLambdaException

Cause

User: arn:aws:sts::335321747591:assumed-role/TidewaterWorkflowsCreateJ-CreateJournalStateMachin-184QJ29APKE3O/VAqgLpXDrcGwUULKzfuDBGJmuwiKLfzI is not authorized to perform: lambda:InvokeFunction on resource: arn:aws:lambda:us-west-2:335321747591:function:LogResources:28 because no identity-based policy allows the lambda:InvokeFunction action (Service: AWSLambda; Status Code: 403; Error Code: AccessDeniedException; Request ID: 6ccb7c61-369f-4826-9fc6-113954ec38c8; Proxy: null)

CDK CLI Version

1.130.0 (build 9c094ae)

Framework Version

No response

Node.js Version

12

OS

macos 10.15.7

Language

Typescript

Language Version

No response

Other information

No response

nsaman avatar Nov 15 '21 20:11 nsaman

Hey @nsaman, how exactly are you going about this in your code?

Are you making use of the LambdaInvoke construct?

peterwoodworth avatar Nov 15 '21 22:11 peterwoodworth

Yes, we are creating a LambdaInvoke on lambdaConstruct.currentVersion.functionArn

nsaman avatar Nov 15 '21 23:11 nsaman

What exactly do you mean by this @nsaman? A snippet of the relevant parts of your code would be helpful

peterwoodworth avatar Nov 16 '21 00:11 peterwoodworth

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

github-actions[bot] avatar Nov 20 '21 20:11 github-actions[bot]

So LambdaInvoke generates policies by lambda function arn. The fucntion arn is versioned and we want the specific versions to be executed. Suppose our step function is long running and points to lambda function 1. The Role has permissions to invoke v1. A CDK deployment updates the lambda version to 2 and Step Function role is updated to invoke lambda v2. When currently running Step Function invokes lambda v1, it fails because the permissions got updated to V2.

There could be multiple solutions. You could add a function addTaskPolicy to update the read only property taskPolicy in LambdaInvoke.

mohitpali avatar Nov 24 '21 18:11 mohitpali

The solution for this will be to generate a policy that looks like:

Resource: [
  'arn:aws:lambda:....:MyFunction',
  'arn:aws:lambda:....:MyFunction:*',
]

It will go well with a change Lambda is about to make where invocations that involve Qualifiers also need to have the qualified ARN in the policy.

rix0rrr avatar Feb 24 '22 09:02 rix0rrr

Hi @nsaman and @mohitpali Apologies for the long delay on this. We've been looking at this recently and came to the conclusion that just providing additional permissions is not the right approach. Using versions in this scenario there are quite a few things to consider and none of them work automatically. The tl;dr is that when a new version is created, the previous version ceases to be managed by CDK/CFN. Deletion can be avoided by setting removal policies. But permissions would either have to be widely scoped (insecure) or maintained by hand (annoying). Saying that, the permission bit is currently not easily done.

Now to my actual question: The idiomatic way to do this in AWS is using Alias. Permissions are granted to the StepFunction to invoke the alias and when a new version is published the Alias gets updated. The StepFunction will always run the latest version and have the correct permissions.

Is there any reasons Aliases would not work in your scenario?

PS: We are still considering addTaskPolicy and other options to open up the generated policies.

mrgrain avatar Aug 08 '22 12:08 mrgrain

Downgrading this to a p2. To provide access to all versions of a Lambda, one can do:

declare stepLambda: lambda.Function;
declare sfn: stepfunctions.StateMachine;

stepLambda.grantInvoke(sfn);

This is very idiomatic. For tightly scoped permissions, Lambda Alias should be used.

mrgrain avatar Aug 09 '22 17:08 mrgrain

As far as I see, the idea behind this, and so not working with aliases IMHO, is to make sure that an SFN execution that started with reference to Lambda v1 will always use this code version and not a newer version of this Lambda function that might be used by executions that are started later.

The reasoning is that the code might have a breaking change that does not work with the inputs in a previous step function definition.

hoegertn avatar Apr 01 '23 00:04 hoegertn

Agreed, Aliases seem like the right solution at first glance from a permissions perspective but play havoc in the long term with your Step Function executions. Using a Lambda alias, a Step Function that is repeatedly invoking a Lambda will be shifted from running version X to version Y, causing all kinds of issues with backwards compatibility. This is unfortunately not documented well in the Step Function documentation but a common occurrence.

The way to get around it is to always have Step Function definitions point to a specific version of a function so that they keep on invoking that until the Step Function is completed. The problem with this is that the generated policies hard code the version.

I feel like there should be a broader discussion around the idiomatic ways that Step Functions recommends Lambdas should be used. As I mentioned, the documentation is silent about this as a best practice or even that this is an issue that customers need to deal with. If there is agreement on 'the best way to invoke lambdas', maybe that can be documented and incorporated into CDK.

gerritmaritz avatar Jun 16 '23 18:06 gerritmaritz

I just discovered this issue because I tried to use StepFunction Alias's deployment preference to implement traffic shaping.

const lambdaFunction = createFunction()
const lambdaInvoke = new LambdaInvoke(this, "Invoke", { lamdaFunction: lambdaFunction.currentVersion })
const stateMachine = new StateMachine(this, "StateMachine", { definitionBody: DefinitionBody.fromChainable(lambdaInvoke) })

const stateMachineVersion = new CfnStateMachineVersion(scope, "StateMachineVersion", {
  stateMachineArn: stateMachine.stateMachineArn,
  stateMachineRevisionId: stateMachine.stateMachineRevisionId,
});

const alias = new CfnStateMachineAlias(scope, "StateMachineAlias", {
  name: "active",
  deploymentPreference: {
    stateMachineVersionArn: stateMachineVersion.attrArn,
    type: "CANARY",
    interval: Duration.hours(2).toMinutes(),
    percentage: 10,
  },
});

However during deployments the state machine's policy is automatically updated to grant invoke for the lambda current version, while 90% of traffic uses the previous state machine version which will invoke the previous lambda version.

Maybe the Ideal solution is for the invocation role and policy to be versioned along with the State Machine, but that goes beyond a CDK feature request.

In my case I'm happy to allow the state machine to invoke any version of the Lambdas. lambda-arn:* or lambda-arn should work. I think this could be made to work with a property on Function, Version, or LambdaInvoke to configure when qualified or unqualified permission is granted.

everett1992 avatar Apr 10 '24 21:04 everett1992