one-observability-demo
one-observability-demo copied to clipboard
C9DiskResize keeps failing
Failed to create resource. An error occurred (Unavailable) when calling the ModifyVolume operation (reached max retries: 4): The service is unavailable. Please try again shortly.

Did this error get resolved at all?
Still a recurring problem, the lambda times out after reaching the maximum retires. as this will be different for each account

What region are you deploying the Cloud9 Stack? Can you check the CloudWatch logs for that lambda function and post it?
eu-west-1, here is the error in the log of the lambda C9DiskResizeLambda function:
{
"timestamp": "2021-04-26 19:54:45,192",
"level": "DEBUG",
"location": "crhelper.utils._send_response:19",
"RequestType": "Create",
"StackId": "arn:aws:cloudformation:eu-west-1:443682937418:stack/C9-Observability-Workshop/deae7fa0-a6c8-11eb-a1d8-0ad741ae72c5",
"RequestId": "c6b84ba4-5c78-45f5-9b42-92c27540ab77",
"LogicalResourceId": "C9DiskResize",
"aws_request_id": "d373d7c2-aa10-4434-8e8a-d2b49d0e741e",
"message": {
"Status": "FAILED",
"PhysicalResourceId": "C9-Observability-Workshop_C9DiskResize_NPBB7K5V",
"StackId": "arn:aws:cloudformation:eu-west-1:443682937418:stack/C9-Observability-Workshop/deae7fa0-a6c8-11eb-a1d8-0ad741ae72c5",
"RequestId": "c6b84ba4-5c78-45f5-9b42-92c27540ab77",
"LogicalResourceId": "C9DiskResize",
"Reason": "An error occurred (Unavailable) when calling the ModifyVolume operation (reached max retries: 4): The service is unavailable. Please try again shortly.",
"Data": {}
}
Also, Manual option for deploying the lab instead of Cloud 9 isn't working correctly, seems that there is some פrerequisites, like adding permissions to S3, and envsetup.sh script is failing as the following commands are not installed:
- pip
- npm
- git
Hello, looks like the C9 instance is not ready for the automation to execute.
Can you manually create a Cloud9 Instance and attach the Instance role to it via the AWS Console?
Regarding your second comment, the script is designed to run in Cloud9 were all those applications are already installed (pip, npm, git). Are you running the script from your local machine?
I tried your suggestion, and manually created Cloud9, everything seems to be working, until I ran the last command, in the Deploy the stack section, I'm getting an error when running the command: 'cdk deploy Applications --require-approval never':
Received response status [FAILED] from custom resource. Message returned: Error: b'serviceaccount/petsite-sa created\nservice/service-petsite created\ndeployment.apps/petsite-deployment created\nError from server (InternalError): error when creating "/tmp/manifest.yaml": Internal error occurred: failed calling webhook "mtargetgroupbinding.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"\n' Logs: /aws/lambda/Applications-ApplicationsMyCluster-Handler886CB40B-P7J44HT23PXJ at invokeUserFunction (/var/task/framework.js:95:19) at process._tickCallback (internal/process/next_tick.js:68:7) (RequestId: 65c72463-1979-428a-aa9f-4dd4c328fb5e)


i have added the logs of /aws/lambda/Applications-ApplicationsMyCluster-Handler886CB40B-YWFJYW1ENTN3:
[ERROR] Exception: b'Error from server (AlreadyExists): error when creating "/tmp/manifest.yaml": serviceaccounts "petsite-sa" already exists\nError from server (Invalid): error when creating "/tmp/manifest.yaml": Service "service-petsite" is invalid: spec.ports[0].nodePort: Invalid value: 30300: provided port is already allocated\nError from server (AlreadyExists): error when creating "/tmp/manifest.yaml": deployments.apps "petsite-deployment" already exists\nError from server (InternalError): error when creating "/tmp/manifest.yaml": Internal error occurred: failed calling webhook "mtargetgroupbinding.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"\n'
Traceback (most recent call last):
File "/var/task/index.py", line 14, in handler
return apply_handler(event, context)
File "/var/task/apply/__init__.py", line 60, in apply_handler
kubectl('create', manifest_file, *kubectl_opts)
File "/var/task/apply/__init__.py", line 87, in kubectl
raise Exception(output)
Can you check the pods running in your cluster? Looks like the Helm chart for the AWS Load Balancer controller was not deployed properly (webhook is not available).
Seems to be running on fine ECS side:
Clusters:

Task Definitions:

But I don't see any pods running in the EKS:

Am I missing something?
By default, even if the role you're using is Admin of the account your won't have enough permissions in the Kubernetes RBAC to see that dashboard (hence the message).
We added some instruction to add your role to the RBAC in order to get you access to EKS Console here.
You should however be allowed to list the pods using kubectl from the Cloud9 environment. Can you do that please and check if the AWS Load balancer is running?
How do I found my value for CONSOLE_ROLE_ARN=<Enter your Role ARN>?
Regards the EC2 Load Balancer, they are in active state:

That is the ARN of the role you use to connect to the AWS Console.
Load balancers are created by CDK Services Stack, but inside your EKS Cluster there is component deployed (AWS Load Balancer) that is failing according to the log message you sent.
How can I fix that (AWS Load Balancer)?
Further more, What is the difference between envsetup.sh and envsetup_ee.sh which one I need to run when using CD9 manually?
We need to see the reason why is it failing to help you with that.
For the bash script, just follow the instruction here:
https://observability.workshop.aws/en/installation/not_using_ee/_deploy_app.html#install-tools-and-clone-the-repository
The second script (_ee) is used for Event Engine.
Where can I see the data why it is failing?
You'll see the logs in the EKS cluster using kubectl tool.
Let's try a different approach, can you tear down that environment completely and start from scratch? Please just create the C9 Environment first from Cloudshell as explained here. Please copy and paste the whole list of commands
curl -O https://raw.githubusercontent.com/aws-samples/one-observability-demo/main/cloud9-cfn.yaml
aws cloudformation create-stack --stack-name C9-Observability-Workshop --template-body file://cloud9-cfn.yaml --capabilities CAPABILITY_NAMED_IAM
aws cloudformation wait stack-create-complete --stack-name C9-Observability-Workshop
echo -e "Cloud9 Instance is Ready!!\n\n"
Are there any SCP applied to your account or is this a personal account? Does your role have Admin access to the environment? (looking at reasons why C9 launch / resize would have failed in the first place).
found the problem, the pod that running the aws load balancer, is pulling the image from an ecr in us-west-2 region, which is a region, not accessible for us. due to our organization policy (eu- only).
running kubectl describe pods, against other pods. I can see that all other pods images, are being pulled from ecr in the eu- region.
how come aws load balancer is being pulled from different region?
also, is it possible to edit the yaml file, and to specifies an ecr in the eu- region for the aws load balancer?
We're installing AWS Load Balancer in CDK using the project Helm chart here.
The default image is configured here.
The current policy in your organization is preventing from pulling cross-region so I'll suggest you to change the Helm Chart default value in your local CDK file (see link in the first paragraph) to include an ECR image path that is allowed inside your organization with the value image.repository.
The image is not available in eu-west-1 so you'll probably will need to pull it from us-west-2 and push it into your ECR.
@rafaelpereyra when trying to edit the image key, specified in the deployment of the aws-load-balancer,
From:
602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.1.3
To:
602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.1.3
By using the following command :
kubectl edit deployment servicesawsloadbalancercontroller2049c530-aws-load-balancer-con -n kube-system
I'm getting the following error: kubectl Edit cancelled, no changes made
Hello,
Looks like an issue with the editor you're using, maybe not saving the changes. Please use this instead:
kubectl set image deployment/servicesawsloadbalancercontrollerXXXXXX-aws-load-balancer-con -n kube-system aws-load-balancer-controller=602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.2.1
Related to original issue with C9DiskResize
Stumbled upon this today
The resolution, in my case, was to temporarily set default-ebs-encryption to false in the EC2 console.
(This was set to true on my account)