aws-otel-collector icon indicating copy to clipboard operation
aws-otel-collector copied to clipboard

ADOT EKS add-on documentation is missing important parts

Open tgraupne opened this issue 10 months ago • 1 comments

Describe the bug The EKS add-on documentation on the official AWS page is linking to this Getting Started Guide: https://aws-otel.github.io/docs/getting-started/adot-eks-add-on

When following this guide, no metrics are send to CloudWatch and the adot-collector is showing warnings.

Steps to reproduce I followed the aforementioned guide.

  1. Create EKS add-on with aws eks create-addon
  2. I deployed the OpenTelemetryCollector custom resource.

What did you expect to see? I expected that the official EKS add-on configures all necessary components to send metrics and logs to CloudWatch.

What did you see instead? No metrics were sent to CloudWatch and the adot-collector showed warning.

Additional context After some hours of online research, I analysed the kubernetes resources created by the adot-operator and discovered differences to the maintained helm charts.

I noticed, that the following resources were missing:

  1. Service Accounts
  2. Cluster Role
  3. Cluster Role Binding
  4. environment values
  5. volumes

Moreover, I found out I needed to use eksctl to create a Service Account / IAM Role combination. I attached the following policy: arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy.

Eventually, I used the following manifest file:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: adot-collector-cluster-role
rules:
  - apiGroups: [""]
    resources: ["pods", "nodes", "endpoints"]
    verbs: ["list", "watch", "get"]
  - apiGroups: ["apps"]
    resources: ["replicasets"]
    verbs: ["list", "watch", "get"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["list", "watch"]
  - apiGroups: [""]
    resources: ["nodes/proxy"]
    verbs: ["get"]
  - apiGroups: [""]
    resources: ["nodes/stats", "configmaps", "events"]
    verbs: ["create", "get"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["update"]
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["otel-container-insight-clusterleader"]
    verbs: ["get","update", "create"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["create","get", "update"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    resourceNames: ["otel-container-insight-clusterleader"]
    verbs: ["get","update", "create"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: adot-collector-cluster-role-binding
subjects:
  - kind: ServiceAccount
    name: adot-collector
    namespace: opentelemetry-operator-system
roleRef:
  kind: ClusterRole
  name: adot-collector-cluster-role
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: adot-collector
  namespace: opentelemetry-operator-system
spec:
  mode: daemonset
  serviceAccount: adot-collector
  securityContext:
    runAsUser: 0
    runAsGroup: 0
  env:
    - name: K8S_NODE_NAME
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName
    - name: HOST_IP
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP
    - name: HOST_NAME
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName
    - name: K8S_NAMESPACE
      valueFrom:
          fieldRef:
            fieldPath: metadata.namespace
  volumes:
    - name: rootfs
      hostPath:
        path: /
    - name: dockersock
      hostPath:
        path: /var/run/docker.sock
    - name: varlibdocker
      hostPath:
        path: /var/lib/docker
    - name: containerdsock
      hostPath:
        path: /run/containerd/containerd.sock
    - name: sys
      hostPath:
        path: /sys
    - name: devdisk
      hostPath:
        path: /dev/disk/
  volumeMounts:
    - name: rootfs
      mountPath: /rootfs
      readOnly: true
    - name: dockersock
      mountPath: /var/run/docker.sock
      readOnly: true
    - name: containerdsock
      mountPath: /run/containerd/containerd.sock
    - name: varlibdocker
      mountPath: /var/lib/docker
      readOnly: true
    - name: sys
      mountPath: /sys
      readOnly: true
    - name: devdisk
      mountPath: /dev/disk
      readOnly: true
    
  config: |
    extensions:
      health_check:

    receivers:
      awscontainerinsightreceiver:

    processors:
      batch/metrics:
        timeout: 60s

    exporters:
      awsemf:
        namespace: ContainerInsights
        log_group_name: '/aws/containerinsights/{ClusterName}/performance'
        log_stream_name: '{NodeName}'
        log_retention: 30
        resource_to_telemetry_conversion:
          enabled: true
        dimension_rollup_option: NoDimensionRollup
        parse_json_encoded_attr_values: [Sources, kubernetes]
        metric_declarations:

          # node metrics
          - dimensions: [[NodeName, InstanceId, ClusterName]]
            metric_name_selectors:
              - node_cpu_utilization
              - node_memory_utilization
              - node_network_total_bytes
              - node_cpu_reserved_capacity
              - node_memory_reserved_capacity
              - node_number_of_running_pods
              - node_number_of_running_containers
          - dimensions: [[ClusterName]]
            metric_name_selectors:
              - node_cpu_utilization
              - node_memory_utilization
              - node_network_total_bytes
              - node_cpu_reserved_capacity
              - node_memory_reserved_capacity
              - node_number_of_running_pods
              - node_number_of_running_containers
              - node_cpu_usage_total
              - node_cpu_limit
              - node_memory_limit

          # pod metrics
          - dimensions: [[PodName, Namespace, ClusterName]]
            metric_name_selectors:
              - pod_status
              - pod_cpu_utilization
              - pod_memory_utilization
              - pod_network_rx_bytes
              - pod_network_tx_bytes
              - pod_cpu_reserved_capacity
              - pod_memory_reserved_capacity
              - pod_number_of_container_restarts
              - pod_cpu_utilization_over_pod_limit
              - pod_memory_utilization_over_pod_limit

          # cluster metrics
          - dimensions: [[ClusterName]]
            metric_name_selectors:
              - cluster_node_count
              - cluster_failed_node_count

          # node fs metrics
          - dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
            metric_name_selectors:
              - node_filesystem_utilization

    service:
      pipelines:
        metrics:
          receivers: [awscontainerinsightreceiver]
          processors: [batch/metrics]
          exporters: [awsemf]

      extensions: [health_check]

tgraupne avatar Apr 29 '24 08:04 tgraupne