k8s-worker-pod-autoscaler icon indicating copy to clipboard operation
k8s-worker-pod-autoscaler copied to clipboard

Unable to fetch queue messages

Open heretogo opened this issue 2 years ago • 8 comments

Hello. I installed the WPA using the script in hack/install.sh.

I am encountering the following error which I believe are permissions or namespace related. I am running a v1.23 cluster in Amazon EKS.

E1103 15:50:14.855014       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/myAccount/myQueueName

The WPA scaler runs in the kube-system namespace and the WPA and example deployment run in a test namespace called eks-sample-app The WPA queueURI was configured manually using kubectl edit

$k get pods -n kube-system
NAME                                   READY   STATUS    RESTARTS   AGE
workerpodautoscaler-8667d55684-9zs6l   1/1     Running   0          72m

$k get wpa -n eks-sample-app
NAME          AGE
example-wpa   85m

$k get deployment -n eks-sample-app
NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
example-deployment            1/1     1            1           7m34s

I have attached the following policy to the cluster service role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "WPA",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:GetMetricData",
                "sqs:ReceiveMessage",
                "sqs:GetQueueAttributes"
            ],
            "Resource": "*"
        }
    ]
}

Any idea how to proceed in debugging this? I have looked through the documentation in the repo's README.md

An unrelated note: what is the context for the WPA Controller section of the docs? In which context would workerpodautoscaler run be invoked? Is this a standalone binary?

heretogo avatar Nov 03 '22 16:11 heretogo

Possible to share the complete log?

alok87 avatar Nov 04 '22 06:11 alok87

Hi @alok87 : please see the example included in my issue. The WPA starts spitting out Unable to fetch no of messages messages as soon as the container starts. There are no other kinds of log messages.

Note that I have anonymized the account number and queue name.

E1104 12:42:26.926463       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926476       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926498       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926513       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926527       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926540       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926552       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue

heretogo avatar Nov 04 '22 12:11 heretogo

Does the queue exist in sqs? Possible to try using sqs client with same creds and see data comes?

Just want to rule out the possibility of configuration issue first

alok87 avatar Nov 04 '22 12:11 alok87

I generated temporary credentials manually using the AssumeRole. I believe it is working now.

Previously my node's role permissions included the following as per this policy:

"cloudwatch:GetMetricData"
"sqs:GetQueueAttributes"
"sqs:ReceiveMessage"

It was resolved by granting all read permissions on SQS:

"cloudwatch:GetMetricData"
"sqs:GetQueueAttributes"
"sqs:GetQueueUrl"
"sqs:ListDeadLetterSourceQueues"
"sqs:ListQueueTags"
"sqs:ListQueues"
"sqs:ReceiveMessage"

My restarted WPA no longer logs any errors.

heretogo avatar Nov 14 '22 20:11 heretogo

can we close this?

alok87 avatar Nov 15 '22 07:11 alok87

Do you think we should update something in the doc here on policy, https://github.com/practo/k8s-worker-pod-autoscaler#install

alok87 avatar Nov 15 '22 07:11 alok87

I feel like there may be something else missing.

Even though I get no permissions errors, I am unable to trigger a scaling operation on the deployment. Any ideas?

I have 10000+ messages in the queue and only one deployment pod running. image

k get pods
NAME                                 READY   STATUS    RESTARTS   AGE
example-deployment-795d868d4-8nzfv   1/1     Running   0          7m19s

Does the WPA require some kind of write or tag attributes?

I can submit a PR for the documentation once I confirm this is working.

heretogo avatar Nov 15 '22 20:11 heretogo

WPA has verbosity in logs, may be try that. -v=4

  • Also share the output of WPA yaml

k get wpa -o yaml <wpa_object>

  • check if deployment replicas changed with queue length
  • check the queue length in AWS shows the 1000 messages? sqs metrics picture if posted here can help.

alok87 avatar Nov 17 '22 06:11 alok87