one-observability-demo C9DiskResize keeps failing

trafficstars

Failed to create resource. An error occurred (Unavailable) when calling the ModifyVolume operation (reached max retries: 4): The service is unavailable. Please try again shortly.

Feb 08 '21 23:02 aws-ps-hobson

Did this error get resolved at all?

Apr 06 '21 01:04 awsimaya

Still a recurring problem, the lambda times out after reaching the maximum retires. as this will be different for each account thumbnail_image

Apr 27 '21 05:04 edwio

What region are you deploying the Cloud9 Stack? Can you check the CloudWatch logs for that lambda function and post it?

Apr 27 '21 12:04 rafaelpereyra

eu-west-1, here is the error in the log of the lambda C9DiskResizeLambda function:

{
    "timestamp": "2021-04-26 19:54:45,192",
    "level": "DEBUG",
    "location": "crhelper.utils._send_response:19",
    "RequestType": "Create",
    "StackId": "arn:aws:cloudformation:eu-west-1:443682937418:stack/C9-Observability-Workshop/deae7fa0-a6c8-11eb-a1d8-0ad741ae72c5",
    "RequestId": "c6b84ba4-5c78-45f5-9b42-92c27540ab77",
    "LogicalResourceId": "C9DiskResize",
    "aws_request_id": "d373d7c2-aa10-4434-8e8a-d2b49d0e741e",
    "message": {
        "Status": "FAILED",
        "PhysicalResourceId": "C9-Observability-Workshop_C9DiskResize_NPBB7K5V",
        "StackId": "arn:aws:cloudformation:eu-west-1:443682937418:stack/C9-Observability-Workshop/deae7fa0-a6c8-11eb-a1d8-0ad741ae72c5",
        "RequestId": "c6b84ba4-5c78-45f5-9b42-92c27540ab77",
        "LogicalResourceId": "C9DiskResize",
        "Reason": "An error occurred (Unavailable) when calling the ModifyVolume operation (reached max retries: 4): The service is unavailable. Please try again shortly.",
        "Data": {}
    }

Also, Manual option for deploying the lab instead of Cloud 9 isn't working correctly, seems that there is some פrerequisites, like adding permissions to S3, and envsetup.sh script is failing as the following commands are not installed:

pip
npm
git

May 14 '21 09:05 edwio

Hello, looks like the C9 instance is not ready for the automation to execute.

Can you manually create a Cloud9 Instance and attach the Instance role to it via the AWS Console?

Regarding your second comment, the script is designed to run in Cloud9 were all those applications are already installed (pip, npm, git). Are you running the script from your local machine?

May 14 '21 13:05 rafaelpereyra

I tried your suggestion, and manually created Cloud9, everything seems to be working, until I ran the last command, in the Deploy the stack section, I'm getting an error when running the command: 'cdk deploy Applications --require-approval never':

Received response status [FAILED] from custom resource. Message returned: Error: b'serviceaccount/petsite-sa created\nservice/service-petsite created\ndeployment.apps/petsite-deployment created\nError from server (InternalError): error when creating "/tmp/manifest.yaml": Internal error occurred: failed calling webhook "mtargetgroupbinding.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"\n' Logs: /aws/lambda/Applications-ApplicationsMyCluster-Handler886CB40B-P7J44HT23PXJ at invokeUserFunction (/var/task/framework.js:95:19) at process._tickCallback (internal/process/next_tick.js:68:7) (RequestId: 65c72463-1979-428a-aa9f-4dd4c328fb5e)

i have added the logs of /aws/lambda/Applications-ApplicationsMyCluster-Handler886CB40B-YWFJYW1ENTN3:

[ERROR] Exception: b'Error from server (AlreadyExists): error when creating "/tmp/manifest.yaml": serviceaccounts "petsite-sa" already exists\nError from server (Invalid): error when creating "/tmp/manifest.yaml": Service "service-petsite" is invalid: spec.ports[0].nodePort: Invalid value: 30300: provided port is already allocated\nError from server (AlreadyExists): error when creating "/tmp/manifest.yaml": deployments.apps "petsite-deployment" already exists\nError from server (InternalError): error when creating "/tmp/manifest.yaml": Internal error occurred: failed calling webhook "mtargetgroupbinding.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"\n'
Traceback (most recent call last):
  File "/var/task/index.py", line 14, in handler
    return apply_handler(event, context)
  File "/var/task/apply/__init__.py", line 60, in apply_handler
    kubectl('create', manifest_file, *kubectl_opts)
  File "/var/task/apply/__init__.py", line 87, in kubectl
    raise Exception(output)

May 14 '21 14:05 edwio

Can you check the pods running in your cluster? Looks like the Helm chart for the AWS Load Balancer controller was not deployed properly (webhook is not available).

May 14 '21 16:05 rafaelpereyra

Seems to be running on fine ECS side:

Clusters:

Task Definitions:

But I don't see any pods running in the EKS:

Am I missing something?

May 15 '21 18:05 edwio

By default, even if the role you're using is Admin of the account your won't have enough permissions in the Kubernetes RBAC to see that dashboard (hence the message).

We added some instruction to add your role to the RBAC in order to get you access to EKS Console here.

You should however be allowed to list the pods using kubectl from the Cloud9 environment. Can you do that please and check if the AWS Load balancer is running?

May 17 '21 13:05 rafaelpereyra

How do I found my value for CONSOLE_ROLE_ARN=<Enter your Role ARN>?

Regards the EC2 Load Balancer, they are in active state:

May 18 '21 06:05 edwio

That is the ARN of the role you use to connect to the AWS Console.

Load balancers are created by CDK Services Stack, but inside your EKS Cluster there is component deployed (AWS Load Balancer) that is failing according to the log message you sent.

May 18 '21 13:05 rafaelpereyra

How can I fix that (AWS Load Balancer)?

Further more, What is the difference between envsetup.sh and envsetup_ee.sh which one I need to run when using CD9 manually?

May 18 '21 18:05 edwio

We need to see the reason why is it failing to help you with that.

For the bash script, just follow the instruction here:

https://observability.workshop.aws/en/installation/not_using_ee/_deploy_app.html#install-tools-and-clone-the-repository

The second script (_ee) is used for Event Engine.

May 18 '21 19:05 rafaelpereyra

Where can I see the data why it is failing?

May 19 '21 07:05 edwio

You'll see the logs in the EKS cluster using kubectl tool.

Let's try a different approach, can you tear down that environment completely and start from scratch? Please just create the C9 Environment first from Cloudshell as explained here. Please copy and paste the whole list of commands

curl -O https://raw.githubusercontent.com/aws-samples/one-observability-demo/main/cloud9-cfn.yaml

aws cloudformation create-stack --stack-name C9-Observability-Workshop --template-body file://cloud9-cfn.yaml --capabilities CAPABILITY_NAMED_IAM

aws cloudformation wait stack-create-complete --stack-name C9-Observability-Workshop

echo -e "Cloud9 Instance is Ready!!\n\n"

Are there any SCP applied to your account or is this a personal account? Does your role have Admin access to the environment? (looking at reasons why C9 launch / resize would have failed in the first place).

May 19 '21 10:05 rafaelpereyra

found the problem, the pod that running the aws load balancer, is pulling the image from an ecr in us-west-2 region, which is a region, not accessible for us. due to our organization policy (eu- only).

running kubectl describe pods, against other pods. I can see that all other pods images, are being pulled from ecr in the eu- region.

how come aws load balancer is being pulled from different region?

also, is it possible to edit the yaml file, and to specifies an ecr in the eu- region for the aws load balancer?

May 19 '21 17:05 edwio

We're installing AWS Load Balancer in CDK using the project Helm chart here.

The default image is configured here.

The current policy in your organization is preventing from pulling cross-region so I'll suggest you to change the Helm Chart default value in your local CDK file (see link in the first paragraph) to include an ECR image path that is allowed inside your organization with the value image.repository.

The image is not available in eu-west-1 so you'll probably will need to pull it from us-west-2 and push it into your ECR.

May 19 '21 18:05 rafaelpereyra

@rafaelpereyra when trying to edit the image key, specified in the deployment of the aws-load-balancer,

From:

602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.1.3

To:

602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.1.3

By using the following command :

kubectl edit deployment servicesawsloadbalancercontroller2049c530-aws-load-balancer-con -n kube-system

I'm getting the following error: kubectl Edit cancelled, no changes made

Jun 28 '21 10:06 edwio

Hello,

Looks like an issue with the editor you're using, maybe not saving the changes. Please use this instead:

kubectl set image deployment/servicesawsloadbalancercontrollerXXXXXX-aws-load-balancer-con -n kube-system aws-load-balancer-controller=602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.2.1

Jul 01 '21 14:07 rafaelpereyra

Related to original issue with C9DiskResize

Stumbled upon this today The resolution, in my case, was to temporarily set default-ebs-encryption to false in the EC2 console.

(This was set to true on my account)

Oct 18 '21 10:10 engrun

one-observability-demo one-observability-demo copied to clipboard

C9DiskResize keeps failing

Failed to create resource. An error occurred (Unavailable) when calling the ModifyVolume operation (reached max retries: 4): The service is unavailable. Please try again shortly.

one-observability-demo
one-observability-demo copied to clipboard