Bug: Docker image-based Lambda failures
Description:
sam deploy intermittently fails while creating Docker image-based Lambdas.
Steps to reproduce:
sam deploy
--stack-name ${STACK_NAME}
--capabilities CAPABILITY_IAM
--no-fail-on-empty-changeset
--resolve-s3
--parameter-overrides REDACTED
--image-repositories bigDumperLambda=${bigDumperRepoUri}
--image-repositories bqLoaderLambda=${bqLoaderRepoUri}
--image-repositories littleCheckerLambda=${littleCheckerRepoUri}
--image-repositories littleDumperLambda=${littleDumperRepoUri}
--image-repositories publisherLambda=${publisherRepoUri}
--image-repositories jobCheckerLambda=${jobCheckerRepoUri}
--image-repositories tableMakerLambda=${tableMakerRepoUri}
--tags exd_version=${EXD_VERSION}
Observed result:
Error message from CloudFormation: Resource handler returned message: "Lambda does not have permission to access the ECR image. Check the ECR permissions. (Service: Lambda, Status Code: 403, Request ID: 3afd69aa-201d-4f73-a500-e739b9bee696) (SDK Attempt Count: 1)" (RequestToken: 35c52deb-ee89-080f-0a66-e94ef9fb4f8e, HandlerErrorCode: AccessDenied)
When I check the ECR repository in question, I find that any preexisting permissions document has been removed.
I can usually resolve this by rerunning the sam deploy command.
Expected result:
Lambdas deployed without errors, on the first try
Additional environment details (Ex: Windows, Mac, Amazon Linux etc)
- OS: Linux
sam --version: 1.142.1- AWS region: us-east-1
@jack-e-tabaska This looks like a classic race condition between CloudFormation creating the Lambda functions and the ECR repository permissions being set up. The fact that you're seeing the ECR permissions document completely removed is particularly telling - this suggests that SAM might be overwriting or racing with the permission setup during deployment.
From what I can see, when deploying multiple Docker image-based Lambda functions simultaneously (you have 7 of them), there could be a timing issue where SAM or CloudFormation is trying to grant the Lambda service permissions to pull from ECR, but something in the deployment process is either removing those permissions prematurely or not waiting for them to propagate before attempting to create the functions. The intermittent nature of the failure - where rerunning fixes it - strongly points to a timing/race condition rather than a configuration issue.
I suspect this might be happening in the SAM CLI's image repository handling code, possibly in how it manages the ECR repository policies when multiple Lambda functions reference the same or different ECR repositories. It could be that when SAM sets up the repository policies for multiple functions, there's a race where one operation overwrites another, or the permissions aren't being applied atomically across all the image repositories you're specifying.
This appears to be a bug in SAM's deployment orchestration rather than something wrong with your configuration. The fact that your --image-repositories flags are correctly mapping each function to its repository suggests your setup is right. Could you share your template.yaml to see how these Lambda functions are defined? Specifically, I'm curious about whether they have explicit IAM roles defined or if they're using SAM-generated roles, as that might affect how the ECR permissions are being managed during the deployment process.