k8s-worker-pod-autoscaler
k8s-worker-pod-autoscaler copied to clipboard
Unable to fetch queue messages
Hello. I installed the WPA using the script in hack/install.sh
.
I am encountering the following error which I believe are permissions or namespace related. I am running a v1.23 cluster in Amazon EKS.
E1103 15:50:14.855014 1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/myAccount/myQueueName
The WPA scaler runs in the kube-system
namespace and the WPA and example deployment run in a test namespace called eks-sample-app
The WPA queueURI was configured manually using kubectl edit
$k get pods -n kube-system
NAME READY STATUS RESTARTS AGE
workerpodautoscaler-8667d55684-9zs6l 1/1 Running 0 72m
$k get wpa -n eks-sample-app
NAME AGE
example-wpa 85m
$k get deployment -n eks-sample-app
NAME READY UP-TO-DATE AVAILABLE AGE
example-deployment 1/1 1 1 7m34s
I have attached the following policy to the cluster service role.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "WPA",
"Effect": "Allow",
"Action": [
"cloudwatch:GetMetricData",
"sqs:ReceiveMessage",
"sqs:GetQueueAttributes"
],
"Resource": "*"
}
]
}
Any idea how to proceed in debugging this? I have looked through the documentation in the repo's README.md
An unrelated note: what is the context for the WPA Controller section of the
docs? In which context would workerpodautoscaler run
be invoked? Is this a standalone
binary?
Possible to share the complete log?
Hi @alok87 : please see the example included in my issue. The WPA starts spitting out Unable to fetch no of messages
messages as soon as the container starts. There are no other kinds of log messages.
Note that I have anonymized the account number and queue name.
E1104 12:42:26.926463 1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926476 1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926498 1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926513 1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926527 1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926540 1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926552 1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
Does the queue exist in sqs? Possible to try using sqs client with same creds and see data comes?
Just want to rule out the possibility of configuration issue first
I generated temporary credentials manually using the AssumeRole. I believe it is working now.
Previously my node's role permissions included the following as per this policy:
"cloudwatch:GetMetricData"
"sqs:GetQueueAttributes"
"sqs:ReceiveMessage"
It was resolved by granting all read permissions on SQS:
"cloudwatch:GetMetricData"
"sqs:GetQueueAttributes"
"sqs:GetQueueUrl"
"sqs:ListDeadLetterSourceQueues"
"sqs:ListQueueTags"
"sqs:ListQueues"
"sqs:ReceiveMessage"
My restarted WPA no longer logs any errors.
can we close this?
Do you think we should update something in the doc here on policy, https://github.com/practo/k8s-worker-pod-autoscaler#install
I feel like there may be something else missing.
Even though I get no permissions errors, I am unable to trigger a scaling operation on the deployment. Any ideas?
I have 10000+ messages in the queue and only one deployment pod running.
k get pods
NAME READY STATUS RESTARTS AGE
example-deployment-795d868d4-8nzfv 1/1 Running 0 7m19s
Does the WPA require some kind of write or tag attributes?
I can submit a PR for the documentation once I confirm this is working.
WPA has verbosity in logs, may be try that. -v=4
- Also share the output of WPA yaml
k get wpa -o yaml <wpa_object>
- check if deployment replicas changed with queue length
- check the queue length in AWS shows the 1000 messages? sqs metrics picture if posted here can help.