amazon-cloudwatch-agent icon indicating copy to clipboard operation
amazon-cloudwatch-agent copied to clipboard

[k8s] Pod metrics is gone when using containerd as runtime

Open pingleig opened this issue 4 years ago • 11 comments

This is exported from internal ticket

TL;DR

The latest image is released, if you were using temp image from this comment https://github.com/aws/amazon-cloudwatch-agent/issues/188#issuecomment-803764697 please update to the latest tag.

If the error message W! No pod metric collected, metrics count is still 7 is containerd socket mounted? https://github.com/aws/amazon-cloudwatch-agent/issues/188leads you to this issue

  • make sure you have updated the yaml to mount the containerd socket into the cloudwatch-agent pod
  • the path for containerd socket may not be in the standard location. e.g. bottlerocket uses /run/dockershim.sock instead of /run/containerd/containerd.sock

Background

We were relying on pause container to have POD for detecting pod, which is the case for docker but not for containerd https://github.com/containerd/cri/issues/922#issuecomment-423729537

User will not see pod metrics in container insight dashboard and they will find the following log which is introduced in #171

https://github.com/aws/amazon-cloudwatch-agent/blob/fbdd619269be7a00172e06992a8d40b22be1a6d7/plugins/inputs/cadvisor/container_info_processor.go#L72-L72

The root cause is we are expecting containerName == 'POD' to mark a path as pod

https://github.com/aws/amazon-cloudwatch-agent/blob/fbdd619269be7a00172e06992a8d40b22be1a6d7/plugins/inputs/cadvisor/container_info_processor.go#L119-L126

Fix

  • code should be similar to https://github.com/DataDog/integrations-core/pull/2283
  • manifest need to mount containerd sock into cwagent daemonset container

Release

The fix will be included in next release, the release date is not determined (yet).

pingleig avatar Mar 21 '21 01:03 pingleig

Created a temp image based on #189 ~~public.ecr.aws/p5m3p1a7/cwagent-k8s-containerd-pod:0.1~~ (the latest official release now contains this fix) and the daemonset yaml need to be udpated to mount /run/containerd/containerd.sock

NOTE: If you are using bottlerocket on eks, the socket on host is different due to https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b You need to (and only need to) replace the volumes part to pick the right sock on host. (Full snippet is at end of comment).

      volumes:
       # ... 
        - name: containerdsock
          hostPath:
            # path: /run/containerd/containerd.sock
            # bottlerocket does not mount containerd sock at normal place
            # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
            path: /run/dockershim.sock

Default containerd path

When host (and kubelet) is using /run/containerd/containerd.sock

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cloudwatch-agent
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: cloudwatch-agent
  template:
    metadata:
      labels:
        name: cloudwatch-agent
    spec:
      containers:
        - name: cloudwatch-agent
          image: public.ecr.aws/p5m3p1a7/cwagent-k8s-containerd-pod:0.1
          imagePullPolicy: Always
          #ports:
          #  - containerPort: 8125
          #    hostPort: 8125
          #    protocol: UDP
          resources:
            limits:
              cpu: 200m
              memory: 200Mi
            requests:
              cpu: 200m
              memory: 200Mi
          # Please don't change below envs
          env:
            - name: HOST_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: HOST_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: K8S_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: CI_VERSION
              value: "k8s/1.3.0"
          # Please don't change the mountPath
          volumeMounts:
            - name: cwagentconfig
              mountPath: /etc/cwagentconfig
            - name: rootfs
              mountPath: /rootfs
              readOnly: true
            - name: dockersock
              mountPath: /var/run/docker.sock
              readOnly: true
            - name: varlibdocker
              mountPath: /var/lib/docker
              readOnly: true
            - name: containerdsock
              mountPath: /run/containerd/containerd.sock
              readOnly: true
            - name: sys
              mountPath: /sys
              readOnly: true
            - name: devdisk
              mountPath: /dev/disk
              readOnly: true
      volumes:
        - name: cwagentconfig
          configMap:
            name: cwagentconfig
        - name: rootfs
          hostPath:
            path: /
        - name: dockersock
          hostPath:
            path: /var/run/docker.sock
        - name: varlibdocker
          hostPath:
            path: /var/lib/docker
        - name: containerdsock
          hostPath:
            path: /run/containerd/containerd.sock
        - name: sys
          hostPath:
            path: /sys
        - name: devdisk
          hostPath:
            path: /dev/disk/
      terminationGracePeriodSeconds: 60
      serviceAccountName: cloudwatch-agent

Non default containerd path

NOTE: You only need to change the volumes, when mount into cloudwatch agent container, you should still put it at default path.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cloudwatch-agent
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: cloudwatch-agent
  template:
    metadata:
      labels:
        name: cloudwatch-agent
    spec:
      # aws eks update-kubeconfig --name eks-pod-metric-missing --region us-west-2
      containers:
        - name: cloudwatch-agent
          image: public.ecr.aws/p5m3p1a7/cwagent-k8s-containerd-pod:0.1
          imagePullPolicy: Always
          #ports:
          #  - containerPort: 8125
          #    hostPort: 8125
          #    protocol: UDP
          resources:
            limits:
              cpu: 200m
              memory: 200Mi
            requests:
              cpu: 200m
              memory: 200Mi
          # Please don't change below envs
          env:
            - name: HOST_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: HOST_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: K8S_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: CI_VERSION
              value: "k8s/1.3.0"
          # Please don't change the mountPath
          volumeMounts:
            - name: cwagentconfig
              mountPath: /etc/cwagentconfig
            - name: rootfs
              mountPath: /rootfs
              readOnly: true
            - name: dockersock
              mountPath: /var/run/docker.sock
              readOnly: true
            - name: varlibdocker
              mountPath: /var/lib/docker
              readOnly: true
            - name: containerdsock
              mountPath: /run/containerd/containerd.sock
              readOnly: true
            - name: sys
              mountPath: /sys
              readOnly: true
            - name: devdisk
              mountPath: /dev/disk
              readOnly: true
      volumes:
        - name: cwagentconfig
          configMap:
            name: cwagentconfig
        - name: rootfs
          hostPath:
            path: /
        - name: dockersock
          hostPath:
            path: /var/run/docker.sock
        - name: varlibdocker
          hostPath:
            path: /var/lib/docker
        - name: containerdsock
          hostPath:
            # path: /run/containerd/containerd.sock
            # bottle rocket does not mount containerd sock at normal place
            # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
            path: /run/dockershim.sock
        - name: sys
          hostPath:
            path: /sys
        - name: devdisk
          hostPath:
            path: /dev/disk/
      terminationGracePeriodSeconds: 60
      serviceAccountName: cloudwatch-agent

pingleig avatar Mar 22 '21 05:03 pingleig

Another known issue is because we are using cadvisor, pod level filesystem usage is ignored

    "container_filesystem_available",
    "container_filesystem_capacity",
    "container_filesystem_usage",
    "container_filesystem_utilization"

https://github.com/google/cadvisor/blob/291c215c5ddc5216659b5e793a98a0ba9f104afb/container/containerd/handler.go#L163-L167

func (h *containerdContainerHandler) GetSpec() (info.ContainerSpec, error) {
	// TODO: Since we dont collect disk usage stats for containerd, we set hasFilesystem
	// to false. Revisit when we support disk usage stats for containerd
	hasFilesystem := false
	spec, err := common.GetSpec(h.cgroupPaths, h.machineInfoFactory, h.needNet(), hasFilesystem)
	spec.Labels = h.labels
	spec.Envs = h.envs
	spec.Image = h.image

	return spec, err
}

pingleig avatar Mar 22 '21 21:03 pingleig

NOTE: container file system usage is not provided after switching to containerd https://github.com/google/cadvisor/issues/2785

Created another issue to track the container filesystem metrics https://github.com/aws/amazon-cloudwatch-agent/issues/192

pingleig avatar Mar 23 '21 01:03 pingleig

Reopen this issue since we are still in the release process, and the official container insight public doc plus sample manifest is not updated yet.

pingleig avatar Mar 31 '21 00:03 pingleig

Close since the release is out

  • https://hub.docker.com/r/amazon/cloudwatch-agent/tags?page=1&ordering=last_updated
  • https://gallery.ecr.aws/cloudwatch-agent/cloudwatch-agent

pingleig avatar Apr 23 '21 17:04 pingleig

This needs fixed within the official helm charts for EKS https://github.com/aws/eks-charts/blob/master/stable/aws-cloudwatch-metrics/templates/daemonset.yaml

fitchtech avatar Aug 20 '21 23:08 fitchtech

@pingleig I have tried applying the fix listed above exactly as is on EKS with the containerd runtime enabled. However, I'm still getting the same error messages:

2021-08-21T00:08:59Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded 2021-08-21T00:08:59Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes 2021-08-21T00:09:00Z W! No pod metric collected, metrics count is still 5 is containerd socket mounted? https://github.com/aws/amazon-cloudwatch-agent/issues/188 2021-08-21T00:09:05Z W! [outputs.cloudwatchlogs] Invalid SequenceToken used, will use new token and retry: The given sequenceToken is invalid. The next expected sequenceToken is: 49605661750447750614958043896578931231172344896032866930 2021-08-21T00:09:05Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 105.761168ms before retrying.

Support for containerd runtime on EKS was added in July when EKS 1.21 was released. https://aws.amazon.com/blogs/containers/amazon-eks-1-21-released/

fitchtech avatar Aug 21 '21 00:08 fitchtech

@fitchtech. The containerd socket on host is in a different path (same as bottlerocket). This is PR for EKS AMI https://github.com/awslabs/amazon-eks-ami/pull/698/files and the config file https://github.com/awslabs/amazon-eks-ami/blob/8450297eb2ef87fe5cbbce52a86ddcdc8b2e6716/files/containerd-config.toml#L1-L6

[grpc]
address = "/run/dockershim.sock"

You can follow non default path in https://github.com/aws/amazon-cloudwatch-agent/issues/188#issuecomment-803764697

          hostPath:
            # path: /run/containerd/containerd.sock
            # bottle rocket does not mount containerd sock at normal place
            # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
            path: /run/dockershim.sock

cc @sethAmazon since both EKS EC2 and Bottlerocket are using /run/dockershim.sock we may change this to =default. Though I was testing using kops at that time, which uses /run/containerd/containerd.sock. I am not sure if it's possible to have one manifest that works for both in our example manifest. Though it should doable for helm.

pingleig avatar Aug 21 '21 05:08 pingleig

@pingleig that worked, thank you. One additional change I had to make is to enable hostNetwork, cause the EC2 instances in my EKS 1.21 node group has the Instance MetaData Service (IMDS) restricted per the EKS security best practices . You have to set hostNetwork: true for it to be able to start up. Once I did everything loaded in the ContainerInsights console.

With hostNetwork: false I get this

2021/08/21 07:23:59 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml 
2021-08-21T07:23:59Z I! Starting AmazonCloudWatchAgent 1.247349.0
2021-08-21T07:23:59Z I! Loaded inputs: k8sapiserver cadvisor
2021-08-21T07:23:59Z I! Loaded aggregators: 
2021-08-21T07:23:59Z I! Loaded processors: ec2tagger k8sdecorator
2021-08-21T07:23:59Z I! Loaded outputs: cloudwatchlogs
2021-08-21T07:23:59Z I! Tags enabled: 
2021-08-21T07:23:59Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-106-12-9.ec2.internal", Flush Interval:1s
2021-08-21T07:23:59Z I! [logagent] starting
2021-08-21T07:23:59Z I! [logagent] found plugin cloudwatchlogs is a log backend

With hostNetwork: true

2021/08/21 07:28:18 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml 
2021-08-21T07:28:18Z I! Starting AmazonCloudWatchAgent 1.247349.0
2021-08-21T07:28:18Z I! Loaded inputs: cadvisor k8sapiserver
2021-08-21T07:28:18Z I! Loaded aggregators: 
2021-08-21T07:28:18Z I! Loaded processors: ec2tagger k8sdecorator
2021-08-21T07:28:18Z I! Loaded outputs: cloudwatchlogs
2021-08-21T07:28:18Z I! Tags enabled: 
2021-08-21T07:28:18Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-106-12-9.ec2.internal", Flush Interval:1s
2021-08-21T07:28:18Z I! [logagent] starting
2021-08-21T07:28:18Z I! [logagent] found plugin cloudwatchlogs is a log backend
2021-08-21T07:28:18Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2021-08-21T07:28:18Z I! k8sapiserver Switch New Leader: ip-10-106-12-14.ec2.internal
2021-08-21T07:28:19Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2021-08-21T07:28:19Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2021-08-21T07:28:26Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 137.608142ms before retrying.
2021-08-21T07:33:34Z I! [processors.ec2tagger] ec2tagger: Refresh is no longer needed, stop refreshTicker.

ec2tagger doesn't like not being able to access the instance metadata service and the containers will restart. Once I set hostNetwork to true I started seeing metrics flow into ContainerInsights. This was even though the DaemonSet is set to a service account that using IAM Roles for Service Accounts (IRSA) with a policy that give it ec2:DescribeVolumes & ec2:DescribeTags

Can an update be made that allows this to work without host network enabled on the daemonset?

fitchtech avatar Aug 21 '21 07:08 fitchtech

Also, the IAM policy document attached to the IRSA role needs allow sts:AssumeRoleWithWebIdentity & sts:AssumeRole resource restricted to the IRSA role ARN or it will throw access denied errors and assume role API call.

fitchtech avatar Aug 21 '21 07:08 fitchtech

@fitchtech. The containerd socket on host is in a different path (same as bottlerocket). This is PR for EKS AMI https://github.com/awslabs/amazon-eks-ami/pull/698/files and the config file https://github.com/awslabs/amazon-eks-ami/blob/8450297eb2ef87fe5cbbce52a86ddcdc8b2e6716/files/containerd-config.toml#L1-L6

[grpc]
address = "/run/dockershim.sock"

You can follow non default path in #188 (comment)

          hostPath:
            # path: /run/containerd/containerd.sock
            # bottle rocket does not mount containerd sock at normal place
            # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
            path: /run/dockershim.sock

cc @sethAmazon since both EKS EC2 and Bottlerocket are using /run/dockershim.sock we may change this to =default. Though I was testing using kops at that time, which uses /run/containerd/containerd.sock. I am not sure if it's possible to have one manifest that works for both in our example manifest. Though it should doable for helm.

The official EKS helm charts for CloudWatch Metrics should be updated to do this instead of applying manifests so that you can use helm templates to conditionally set those based on values provided.

fitchtech avatar Aug 21 '21 08:08 fitchtech