aws-cdk
aws-cdk copied to clipboard
(stepfunctions): CDK generated stepfunction roles breaking inflight stepfunction executions with versioned lambdas
What is the problem?
Using stepfunction auto generate of stepfunction roles and also use versioned lambdas in the step functions. On deployment, the stepfunction role is updated with the new lambda version. This causes invoke:lambda role failures in in-flight stepfunction executions as they will have the previous lambda version in their stepfunction execution definition but will now have the newer lambda version in the stepfunction role.
Is there way to have stepfunction auto generated roles to not include the lambda version in the role?
Reproduction Steps
Create a stepfunction that invokes a lambda version. The stepfunction role will contain a lambda version
What did you expect to happen?
Stepfunctions to not fail on inflight executions during a deployment
What actually happened?
Stepfunction lambda:invoke errors on mismatched lambda versions: Error
Lambda.AWSLambdaException
Cause
User: arn:aws:sts::335321747591:assumed-role/TidewaterWorkflowsCreateJ-CreateJournalStateMachin-184QJ29APKE3O/VAqgLpXDrcGwUULKzfuDBGJmuwiKLfzI is not authorized to perform: lambda:InvokeFunction on resource: arn:aws:lambda:us-west-2:335321747591:function:LogResources:28 because no identity-based policy allows the lambda:InvokeFunction action (Service: AWSLambda; Status Code: 403; Error Code: AccessDeniedException; Request ID: 6ccb7c61-369f-4826-9fc6-113954ec38c8; Proxy: null)
CDK CLI Version
1.130.0 (build 9c094ae)
Framework Version
No response
Node.js Version
12
OS
macos 10.15.7
Language
Typescript
Language Version
No response
Other information
No response
Hey @nsaman, how exactly are you going about this in your code?
Are you making use of the LambdaInvoke construct?
Yes, we are creating a LambdaInvoke on lambdaConstruct.currentVersion.functionArn
What exactly do you mean by this @nsaman? A snippet of the relevant parts of your code would be helpful
This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.
So LambdaInvoke generates policies by lambda function arn. The fucntion arn is versioned and we want the specific versions to be executed. Suppose our step function is long running and points to lambda function 1. The Role has permissions to invoke v1. A CDK deployment updates the lambda version to 2 and Step Function role is updated to invoke lambda v2. When currently running Step Function invokes lambda v1, it fails because the permissions got updated to V2.
There could be multiple solutions. You could add a function addTaskPolicy to update the read only property taskPolicy in LambdaInvoke.
The solution for this will be to generate a policy that looks like:
Resource: [
'arn:aws:lambda:....:MyFunction',
'arn:aws:lambda:....:MyFunction:*',
]
It will go well with a change Lambda is about to make where invocations that involve Qualifiers also need to have the qualified ARN in the policy.
Hi @nsaman and @mohitpali Apologies for the long delay on this. We've been looking at this recently and came to the conclusion that just providing additional permissions is not the right approach. Using versions in this scenario there are quite a few things to consider and none of them work automatically. The tl;dr is that when a new version is created, the previous version ceases to be managed by CDK/CFN. Deletion can be avoided by setting removal policies. But permissions would either have to be widely scoped (insecure) or maintained by hand (annoying). Saying that, the permission bit is currently not easily done.
Now to my actual question: The idiomatic way to do this in AWS is using Alias. Permissions are granted to the StepFunction to invoke the alias and when a new version is published the Alias gets updated. The StepFunction will always run the latest version and have the correct permissions.
Is there any reasons Aliases would not work in your scenario?
PS: We are still considering addTaskPolicy
and other options to open up the generated policies.
Downgrading this to a p2. To provide access to all versions of a Lambda, one can do:
declare stepLambda: lambda.Function;
declare sfn: stepfunctions.StateMachine;
stepLambda.grantInvoke(sfn);
This is very idiomatic. For tightly scoped permissions, Lambda Alias should be used.
As far as I see, the idea behind this, and so not working with aliases IMHO, is to make sure that an SFN execution that started with reference to Lambda v1 will always use this code version and not a newer version of this Lambda function that might be used by executions that are started later.
The reasoning is that the code might have a breaking change that does not work with the inputs in a previous step function definition.
Agreed, Aliases seem like the right solution at first glance from a permissions perspective but play havoc in the long term with your Step Function executions. Using a Lambda alias, a Step Function that is repeatedly invoking a Lambda will be shifted from running version X to version Y, causing all kinds of issues with backwards compatibility. This is unfortunately not documented well in the Step Function documentation but a common occurrence.
The way to get around it is to always have Step Function definitions point to a specific version of a function so that they keep on invoking that until the Step Function is completed. The problem with this is that the generated policies hard code the version.
I feel like there should be a broader discussion around the idiomatic ways that Step Functions recommends Lambdas should be used. As I mentioned, the documentation is silent about this as a best practice or even that this is an issue that customers need to deal with. If there is agreement on 'the best way to invoke lambdas', maybe that can be documented and incorporated into CDK.
I just discovered this issue because I tried to use StepFunction Alias's deployment preference to implement traffic shaping.
const lambdaFunction = createFunction()
const lambdaInvoke = new LambdaInvoke(this, "Invoke", { lamdaFunction: lambdaFunction.currentVersion })
const stateMachine = new StateMachine(this, "StateMachine", { definitionBody: DefinitionBody.fromChainable(lambdaInvoke) })
const stateMachineVersion = new CfnStateMachineVersion(scope, "StateMachineVersion", {
stateMachineArn: stateMachine.stateMachineArn,
stateMachineRevisionId: stateMachine.stateMachineRevisionId,
});
const alias = new CfnStateMachineAlias(scope, "StateMachineAlias", {
name: "active",
deploymentPreference: {
stateMachineVersionArn: stateMachineVersion.attrArn,
type: "CANARY",
interval: Duration.hours(2).toMinutes(),
percentage: 10,
},
});
However during deployments the state machine's policy is automatically updated to grant invoke for the lambda current version, while 90% of traffic uses the previous state machine version which will invoke the previous lambda version.
Maybe the Ideal solution is for the invocation role and policy to be versioned along with the State Machine, but that goes beyond a CDK feature request.
In my case I'm happy to allow the state machine to invoke any version of the Lambdas. lambda-arn:*
or lambda-arn
should work. I think this could be made to work with a property on Function
, Version
, or LambdaInvoke
to configure when qualified or unqualified permission is granted.