aws-otel-community
aws-otel-community copied to clipboard
AccessDeniedException when using ADOT with an EKS cluster
I'm receiving an error with a basic setup of ADOT, so probably I'm missing something. I just created a new EKS cluster, adding ADOT as addon. Next step was to add a ClusterConfig like this
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: develop
region: us-east-1
iam:
withOIDC: true
serviceAccounts:
- metadata:
name: adot-collector
namespace: testnamespace
attachPolicyARNs:
- "arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess"
- "arn:aws:iam::aws:policy/AWSXrayWriteOnlyAccess"
- "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
after that I created the following OpenTelemetryCollector using the sidecar mode
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: develop-collector-xray
spec:
mode: sidecar
resources:
requests:
cpu: "1"
limits:
cpu: "1"
serviceAccount: adot-collector
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
exporters:
logging:
loglevel: debug
awsxray:
region: 'us-east-1'
service:
pipelines:
traces:
receivers: [otlp]
exporters: [awsxray]
telemetry:
logs:
level: debug
I added the annotation
sidecar.opentelemetry.io/inject: "true"
to my pod definition. I started the application using the java agent and passing the required env variables
ENV OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
ENV OTEL_RESOURCE_ATTRIBUTES=service.namespace=test-be,service.name=test-be
ENV AWS_REGION=us-east-1
ENV OTEL_METRICS_EXPORTER=otlp
CMD java -javaagent:/app/bin/aws-opentelemetry-agent.jar -jar /app/bin/registry.jar
Once started I can see the injected sidecar pod when tracing doesn't work and from the logs I can see the following error
2023-01-26T12:05:36.665Z debug [email protected]/awsxray.go:70 response error {"kind": "exporter", "data_type": "traces", "name": "awsxray", "error": "AccessDeniedException: \n\tstatus code: 403, request id: c3c8ff28-18c5-4c2c-a5b5-e48b93b020c4"}
2023-01-26T12:05:36.665Z debug [email protected]/awsxray.go:74 response: {
} {"kind": "exporter", "data_type": "traces", "name": "awsxray"}
I'm probably missing some authorization somewhere but I don't have idea where because I followed the official guideline
https://docs.aws.amazon.com/eks/latest/userguide/opentelemetry.html
Any ideas?
Thanks
Looks like that IRSA is not working as expected. You want to make sure that the service account adot-collector
that you created via eksctl
is in the same namespace as the ADOT collector.
Yes I already checked, it's in the same namespace.
The only additional thing (but I don't know if it could be a problem) is that I have a constraint in my account so in every IAM role I must add a permission boundary. But of course I added it to the ClusterConfig otherwise I cannot create it (I didn't report it in the example). I don't know if this constraint can block the standard flow in other parts, but in the service account is present.
Oh the wonderful world of permission boundaries. Not sure if we have the complete picture, knowing this now. Two options: if you have Enterprise support, please cut us a ticket via your TAM or SA. If not, I'd work from left, that is, check: serviceaccount - > pod -> IAM role or try out a different mode (deployment).
Please note that we offer support via GitHub on a best effort basis, so could take some time (hence, suggesting the support route).
Oh the wonderful world of permission boundaries.
Yes I know :(
Not sure if we have the complete picture, knowing this now. Two options: if you have Enterprise support, please cut us a ticket via your TAM or SA.
Unfortunately on this account we a Basic plan for the moment
If not, I'd work from left, that is, check: serviceaccount - > pod -> IAM role or try out a different mode (deployment).
For check what you mean? Anyway now I tried using deployment mode and it works. For the development purposes it's ok but we would like to use the sidecar mode. Is it possible something missing inside pod configuration?
Please note that we offer support via GitHub on a best effort basis, so could take some time (hence, suggesting the support route).
I know of course ;)
Anyway now I tried using deployment mode and it works. For the development purposes it's ok but we would like to use the sidecar mode. Is it possible something missing inside pod configuration?
Interesting. Let me look into this (note that the add-on is using upstream OpenTelemetry operator) and get back to you.
Would you mind expanding on why you prefer sidecar over deployment or other non-sidecar modes?
Interesting. Let me look into this (note that the add-on is using upstream OpenTelemetry operator) and get back to you.
great
Would you mind expanding on why you prefer sidecar over deployment or other non-sidecar modes?
It's a consideration based on a previous environment with Jaeger where we switched from a single collector (sometimes it had problems but I really don't remember the specific cause) to a sidercar container. Of course we can evaluate different mode if it works :)
Thanks for the context @fpaparoni and I would recommend to evaluate other modes, yes. Depending on your workload (number of pods, using sidecar mode can be a rather resource intensive option).
Hi @fpaparoni, can you confirm that the Collector and the Pod you are annotating are in the same namespace? That may be a reason why the sidecar mode doesn't seem to be working.
Yes in both modes Collector, Pod and Service Account are in the same namespace. Deployment now works, if I switch to sidecar i receive an AccessDeniedException
Hey @fpaparoni, any updates here? Were you ever able to get sidecar deployment of the Collector working? If not, I'd like to dive a bit deeper into why this issue might be happening.
We are using the deployment mode without problems and never switched back to sidecar. If it can be useful I can make some specific tests
We are using the deployment mode without problems and never switched back to sidecar. If it can be useful I can make some specific tests
I see - I've been trying to replicate your issue with no luck, but I haven't involved permission boundaries at all so that might be where the issue lies.
Also, when you say you can make specific tests, what are you referring to? What tests do you think would be useful to create?
I was thinking about looking at specific logs if useful, anyway we are now using without problems deployment mode and we won't come back