aws-otel-community icon indicating copy to clipboard operation
aws-otel-community copied to clipboard

AccessDeniedException when using ADOT with an EKS cluster

Open fpaparoni opened this issue 2 years ago • 13 comments

I'm receiving an error with a basic setup of ADOT, so probably I'm missing something. I just created a new EKS cluster, adding ADOT as addon. Next step was to add a ClusterConfig like this

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: develop
  region: us-east-1

iam:
  withOIDC: true
  serviceAccounts:
    - metadata:
        name: adot-collector
        namespace: testnamespace
      attachPolicyARNs:
      - "arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess"
      - "arn:aws:iam::aws:policy/AWSXrayWriteOnlyAccess"
      - "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"

after that I created the following OpenTelemetryCollector using the sidecar mode

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: develop-collector-xray
spec:
  mode: sidecar 
  resources:
    requests:
      cpu: "1"
    limits:
      cpu: "1"
  serviceAccount: adot-collector
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
            
    processors:
      batch:

    exporters:
      logging:
        loglevel: debug
      awsxray:
        region: 'us-east-1'

    service:
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [awsxray]
      telemetry:
        logs:
          level: debug

I added the annotation

sidecar.opentelemetry.io/inject: "true"

to my pod definition. I started the application using the java agent and passing the required env variables

ENV OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
ENV OTEL_RESOURCE_ATTRIBUTES=service.namespace=test-be,service.name=test-be
ENV AWS_REGION=us-east-1
ENV OTEL_METRICS_EXPORTER=otlp
CMD java -javaagent:/app/bin/aws-opentelemetry-agent.jar -jar /app/bin/registry.jar

Once started I can see the injected sidecar pod when tracing doesn't work and from the logs I can see the following error

2023-01-26T12:05:36.665Z	debug	[email protected]/awsxray.go:70	response error	{"kind": "exporter", "data_type": "traces", "name": "awsxray", "error": "AccessDeniedException: \n\tstatus code: 403, request id: c3c8ff28-18c5-4c2c-a5b5-e48b93b020c4"}
2023-01-26T12:05:36.665Z	debug	[email protected]/awsxray.go:74	response: {

}	{"kind": "exporter", "data_type": "traces", "name": "awsxray"}

I'm probably missing some authorization somewhere but I don't have idea where because I followed the official guideline

https://docs.aws.amazon.com/eks/latest/userguide/opentelemetry.html

Any ideas?

Thanks

fpaparoni avatar Jan 26 '23 12:01 fpaparoni

Looks like that IRSA is not working as expected. You want to make sure that the service account adot-collector that you created via eksctl is in the same namespace as the ADOT collector.

mhausenblas avatar Jan 26 '23 12:01 mhausenblas

Yes I already checked, it's in the same namespace.

The only additional thing (but I don't know if it could be a problem) is that I have a constraint in my account so in every IAM role I must add a permission boundary. But of course I added it to the ClusterConfig otherwise I cannot create it (I didn't report it in the example). I don't know if this constraint can block the standard flow in other parts, but in the service account is present.

fpaparoni avatar Jan 26 '23 13:01 fpaparoni

Oh the wonderful world of permission boundaries. Not sure if we have the complete picture, knowing this now. Two options: if you have Enterprise support, please cut us a ticket via your TAM or SA. If not, I'd work from left, that is, check: serviceaccount - > pod -> IAM role or try out a different mode (deployment).

Please note that we offer support via GitHub on a best effort basis, so could take some time (hence, suggesting the support route).

mhausenblas avatar Jan 26 '23 13:01 mhausenblas

Oh the wonderful world of permission boundaries.

Yes I know :(

Not sure if we have the complete picture, knowing this now. Two options: if you have Enterprise support, please cut us a ticket via your TAM or SA.

Unfortunately on this account we a Basic plan for the moment

If not, I'd work from left, that is, check: serviceaccount - > pod -> IAM role or try out a different mode (deployment).

For check what you mean? Anyway now I tried using deployment mode and it works. For the development purposes it's ok but we would like to use the sidecar mode. Is it possible something missing inside pod configuration?

Please note that we offer support via GitHub on a best effort basis, so could take some time (hence, suggesting the support route).

I know of course ;)

fpaparoni avatar Jan 26 '23 14:01 fpaparoni

Anyway now I tried using deployment mode and it works. For the development purposes it's ok but we would like to use the sidecar mode. Is it possible something missing inside pod configuration?

Interesting. Let me look into this (note that the add-on is using upstream OpenTelemetry operator) and get back to you.

Would you mind expanding on why you prefer sidecar over deployment or other non-sidecar modes?

mhausenblas avatar Jan 26 '23 14:01 mhausenblas

Interesting. Let me look into this (note that the add-on is using upstream OpenTelemetry operator) and get back to you.

great

Would you mind expanding on why you prefer sidecar over deployment or other non-sidecar modes?

It's a consideration based on a previous environment with Jaeger where we switched from a single collector (sometimes it had problems but I really don't remember the specific cause) to a sidercar container. Of course we can evaluate different mode if it works :)

fpaparoni avatar Jan 26 '23 15:01 fpaparoni

Thanks for the context @fpaparoni and I would recommend to evaluate other modes, yes. Depending on your workload (number of pods, using sidecar mode can be a rather resource intensive option).

mhausenblas avatar Jan 26 '23 15:01 mhausenblas

Hi @fpaparoni, can you confirm that the Collector and the Pod you are annotating are in the same namespace? That may be a reason why the sidecar mode doesn't seem to be working.

erichsueh3 avatar Jan 26 '23 20:01 erichsueh3

Yes in both modes Collector, Pod and Service Account are in the same namespace. Deployment now works, if I switch to sidecar i receive an AccessDeniedException

fpaparoni avatar Jan 27 '23 08:01 fpaparoni

Hey @fpaparoni, any updates here? Were you ever able to get sidecar deployment of the Collector working? If not, I'd like to dive a bit deeper into why this issue might be happening.

erichsueh3 avatar Feb 22 '23 19:02 erichsueh3

We are using the deployment mode without problems and never switched back to sidecar. If it can be useful I can make some specific tests

fpaparoni avatar Feb 24 '23 09:02 fpaparoni

We are using the deployment mode without problems and never switched back to sidecar. If it can be useful I can make some specific tests

I see - I've been trying to replicate your issue with no luck, but I haven't involved permission boundaries at all so that might be where the issue lies.

Also, when you say you can make specific tests, what are you referring to? What tests do you think would be useful to create?

erichsueh3 avatar Mar 04 '23 00:03 erichsueh3

I was thinking about looking at specific logs if useful, anyway we are now using without problems deployment mode and we won't come back

fpaparoni avatar Mar 21 '23 20:03 fpaparoni