amazon-ecs-agent icon indicating copy to clipboard operation
amazon-ecs-agent copied to clipboard

Docker CPU Usage Increases Over Time

Open stewartcampbell opened this issue 3 years ago • 13 comments

Summary

CPU usage increases over time using short-lived tasks connected to EFS.

Description

We have a slightly unusual set-up where we have maybe 10 short-lived tasks running every minute. These tasks mount 2 EFS volumes and do what they need to do, then shut down again. Not all tasks are equal, but none run longer than 10 minutes, and most well under 1 minute. There are lots of times when no tasks are running at all.

We notice that CPU usage, as well as I/O read/write creeps up over time. The graph below shows the increase on a t3.small instance after launch yesterday.

image

The number of tasks and frequency is not changing.

systemctl restart docker causes the CPU to go down again, then it will start to creep up as before.

Expected Behavior

CPU / IO to stay steady.

Observed Behavior

CPU / IO slowly increases.

Environment Details

[root@ip-10-0-16-34 bin]# docker info
Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 35
  Running: 3
  Paused: 0
  Stopped: 32
 Images: 6
 Server Version: 20.10.7
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: amazon-ecs-volume-plugin local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc version: 84113eef6fc27af1b01b3181f31bbaf708715301
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.256-197.484.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 1.908GiB
 Name: ip-10-0-16-34.eu-west-1.compute.internal
 ID: MV2W:RZIE:YOBU:UOUS:7KLN:737Q:L577:IZ4Q:B2IA:ZKWH:3HHJ:OBJ3
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
[root@ip-10-0-16-34 /]# cat /etc/ecs/ecs.config
ECS_CLUSTER=dev-xxxxxx-cli-1
ECS_CONTAINER_STOP_TIMEOUT=10s
ECS_ENABLE_SPOT_INSTANCE_DRAINING=true
ECS_ENABLE_TASK_IAM_ROLE=true
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=10m
ECS_IMAGE_PULL_BEHAVIOR=prefer-cached
ECS_RESERVED_MEMORY=200
[root@ip-10-0-16-34 /]# curl -http://localhost:51678/v1/metadata
{"Cluster":"dev-xxxxxx-cli-1","ContainerInstanceArn":"arn:aws:ecs:eu-west-1:xxxxxxxxxxxx:container-instance/dev-xxxxxx-cli-1/61f449ba47e1456da3723509768642d0","Version":"Amazon ECS Agent - v1.58.0 (80ad4ca4)"}
[root@ip-10-0-16-34 /]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        970M     0  970M   0% /dev
tmpfs           978M     0  978M   0% /dev/shm
tmpfs           978M  3.0M  975M   1% /run
tmpfs           978M     0  978M   0% /sys/fs/cgroup
/dev/nvme0n1p1  100G  5.2G   95G   6% /
overlay         100G  5.2G   95G   6% /var/lib/docker/overlay2/4efe78757cb957c42ffe401ba33f053e3f91748a2dbc8535421dce9572d8d2b2/merged
shm              64M     0   64M   0% /var/lib/docker/containers/e41525561b6bd9ff16a8409c86bf92b9521df636fe4992b279fafd872927af7c/mounts/shm
overlay         100G  5.2G   95G   6% /var/lib/docker/overlay2/d88d51191a43d506e4e1e395387c6ec3d5b3672e01ca29d95af5bfc2941b168e/merged
shm              64M     0   64M   0% /var/lib/docker/containers/c93fa0d857d8d1817d67eb3e8a48a7e5a5648ce0aec0ead8f0b74b034211dc06/mounts/shm
overlay         100G  5.2G   95G   6% /var/lib/docker/overlay2/57051246ac2e59aca51562ca20511448b61f0bc688f017ab63791179c25ae9f2/merged
shm              64M     0   64M   0% /var/lib/docker/containers/a94ccc4641109bd81cdb6b682f80517698fedf4457b61511d403b663e2133b89/mounts/shm
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-9a8ae5c0ffb0e09c9f01
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-c8bb82de8cd0eda28001
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-a2cec5ff969afe8ec801
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-90dae2aad7efeda6c201
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-ec81ea8a9aaef0820700
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-d2edcbc8e0dad8ef6a00
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-eecbf2a393dbd5d40500
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-c2d7ffe4ce9daff9b001
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-feb99ed1e1dbd786de01
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-e0a5ccdf9efd80aaed01
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-84e599c9d7ffd5e77200
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-d6bab1e6f99d8e86de01
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-80b99494d5d6bcd1fc01
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-949be4d18f83efadcd01
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-94fbc6c8b785a1c10e00
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-e282e189dfceafe2e301
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-ae839cf19dd2c3a33d00
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-d8fed99b87a3a7d29801
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-acac94addfbad3f8e501
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-cce2ddb18ec193800a00
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-e4e8d09adcd5e6fc7900
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-eeba9acdac83c0d01c00
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-a0abe9f8a2ba95e73b00
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-9ec6e2baeae9f28ac201
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-f0b1daac81c6f3d25700
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-ea8394f29e9f8fc66e00
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-9aabf2e4849083860f00
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-92bceba498e797b30700
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-9695809b82dd80846d00
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-f286d4c7e19f80e7ea01
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-86e5b2e8eed3eac68401
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-fa9a86ae9692b3c32300
tmpfs           196M     0  196M   0% /run/user/0
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-aaf0ccd6f78a95e05800
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-80f3a5cbecb6d6b4e701
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-a0f4e792c2fdc7e83900
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-bca2e483aff7c3a66100
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-e680b8a093c7fd8e0a00
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-def795f1ddd2edcfde01
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-aa8cb19cdbf1e6f9ae01
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-88d5ceebdfc8a6d4bf01
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-d28cb1aaefedb4f9c701
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-86e2e3ebd4e9a4d43800
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-98ed8691e589fbf80d00
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-88abdafed6d3eabdc301
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-86ba98a4e3e3f1c52600
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-e4e3d3dbb387afef0600
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-80e0f4b493bafe9bb401
127.0.0.1:/     8.0E  139M  8.0E   1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-b6d38f9cd0b8eea92400

Supporting Log Snippets

would rather not post logs publicly

Thanks for any help.

stewartcampbell avatar Feb 08 '22 08:02 stewartcampbell

Hi, Thanks for submitting the issue. To help us debug, could you send the following information via email to [email protected]:

  • task definition used
  • instance log collected via ecs log collector on the instance that was having the issue.

thanks. in the meantime i will try to repro the issue

fenxiong avatar Feb 08 '22 19:02 fenxiong

Thanks for getting back to me. I have sent the files.

For reference, the load is still increasing. image

stewartcampbell avatar Feb 09 '22 10:02 stewartcampbell

Thanks for the information. I was able to repro the issue by running 10 short-lived tasks per minute that uses EFS volume, and observed that the cpu usage increases slowly but consistently over time, despite task launching rate did not change. will be taking a deeper look into what is consuming the cpu

fenxiong avatar Feb 10 '22 17:02 fenxiong

upon further investigation, it appears that the repro i mentioned in previous comment was not valid. specifically, i was using ps to measure the cpu percentage, which, according to man ps, does not reflect the real time cpu usage, but instead the percentage averaged over process lifetime:

CPU usage is currently expressed as the percentage of time spent
       running during the entire lifetime of a process.

so when cpu intensive task started, the averaged cpu usage percentage used by dockerd shown by ps continues growing. when i switch to use top which shows the cpu percentage in real time, i was able to observe that the cpu usage reaches similar level when a task is running, and drops to 0 when no task is run. i wonder if you might be having similar issue, as the cpu usage output by ps on your instance (collected in the logs you sent) is similar to what you have in the screenshot. so to confirm, how are you measuring the cpu usage on docker exactly? @stewartcampbell

fenxiong avatar Feb 14 '22 21:02 fenxiong

I'm using new Relic infrastructure agent. You can see the same increase using CloudWatch metrics though at the CPU level.

image

If I restart the Docker daemon or the EC2 instance is replaced by the capacity provider, the average CPU load on the new instance will be low again.

I should also note that even after removing the NR agent, the CPU increase still occurs, so it isn't related to the NR agent.

stewartcampbell avatar Feb 15 '22 08:02 stewartcampbell

Hi! Following up on this issue.

@stewartcampbell, are you certain what is consuming the CPU is the docker daemon? How are you able to determine it's dockerd vs anything else on the instance?

If we're certain the issue is in dockerd, we can try profiling it to see if we can pinpoint the issue. In order to do so, I need you to do the following in one of your instances with the problem:

1. Enable debug mode for docker daemon:

  • if the file /etc/docker/daemon.json DOESN'T exist, create it and append the debug flag using the command below. Otherwise, just edit the file and add "debug": true to the existing JSON.
sudo echo '{"debug": true}' > /etc/docker/daemon.json
  • Reload docker
sudo kill -SIGHUP $(pidof dockerd)
  • Verify docker is running
docker ps

2. Bind dockerd's pprof endpoint to local port

socat -d -d TCP-LISTEN:8181,fork,bind=127.0.0.1 UNIX:/var/run/docker.sock &

3. Generate profiles

  • Generate a CPU profile. The command will collect CPU data in the background for 30 minutes. Hopefully this will be sufficient to see what in docker is doing. Make sure that you execute the command below and that your instance is running your tasks normally for at least 30 minutes (i.e. 10 short-lived tasks per minute, as described in this issue's summary).
curl "http://localhost:8181/debug/pprof/profile?seconds=1800" --output dockerd-30m.profile &
  • Generate goroutine profile

The goal of this command is to see if docker is leaking goroutines. Since we can only see this after the fact, please make sure you execute this command at least 30 minutes AFTER your tasks have been running normally.

curl "http://localhost:8181/debug/pprof/goroutine" --output dockerd.goroutine

4. Send us the data

Please send the dockerd-30m.profile, dockerd.goroutine generated in step #3, as well as the full ecs log suite to [email protected] for analysis.

angelcar avatar Feb 21 '22 20:02 angelcar

Hi. I have sent the email as requested.

I can see the Docker daemon is increasing using the New Relic infrastructure agent. The instance I was working on recycled earlier, so I don't have a long history for it, but you can see from the image below that the CPU usage for the docker process is slowly increasing. Images I have posted earlier in this issue show it increasing over a longer period.

image

stewartcampbell avatar Feb 22 '22 11:02 stewartcampbell

Hi, I was going over the profiles but nothing really stood out. The CPU profile shows docker is busy on regular stuff and nothing seems related to efs. (See image below for CPU visualization).

Similarly, the goroutine profile looks normal, there are not very many active goroutines, bit this could be because the profile was taken a bit early.

Any chance you can repeat the exercise in the same instance so that we can see the change over time? If we can't tie this to efs plugin, I think it probably be a good idea to open an issue with docker directly to see if they're aware of any CPU issues.

It would be great if we could workout a way to repro since so far this only seem to be happening with your specific work-load traffic patterns, as we could not repro by way of starting 10 tasks per minute with efs volumes attached.

Screen Shot 2022-02-22 at 4 46 55 PM

angelcar avatar Feb 23 '22 00:02 angelcar

Thanks for getting back to me. I haven't forgotten about this. Will need to pick it back up next week.

stewartcampbell avatar Feb 25 '22 14:02 stewartcampbell

I have added a repo at https://github.com/stewartcampbell/poc-ecs-cpu-issue which will reproduce the issue for you. It is stand alone and will deploy a VPC, ECS cluster etc as well as schedule tasks. It is written in Terraform, and all you need to change is the AWS profile name in provider.tf.

I deployed the same yesterday and let it run overnight. You can see the slow increase of CPU activity below. image

Hopefully this will help you prove that it is happening. It uses the latest PHP 7.4 CLI image from ECR public, but doesn't actually run PHP. It just runs sleep 10. Just make sure to run it for a few days as the CPU increase is only gradual.

stewartcampbell avatar Mar 08 '22 07:03 stewartcampbell

I need to tear down this test stack now, however as can be seen below, the CPU usage is still rising. It does seem to be slowing down, though.

image

stewartcampbell avatar Mar 14 '22 08:03 stewartcampbell

Hi @angelcar did you ever get a chance to reproduce this?

stewartcampbell avatar May 11 '22 09:05 stewartcampbell

@stewartcampbell following up on this -- I do see that you've got steps to repro which do include 2 EFS mount points. As @angelcar mentioned

Any chance you can repeat the exercise in the same instance so that we can see the change over time? If we can't tie this to efs plugin, I think it probably be a good idea to open an issue with docker directly to see if they're aware of any CPU issues.

We want to pinpoint whether this is specific to your workload, or whether it's tied to the EFS plugin, or whether it's something else altogether. If you've been able to capture any clear profiling data to point us in a direction, that would be helpful. I'll also try on my end to try out Angel's profiling scenario with the terraform setup you've provided.

fierlion avatar Aug 24 '22 22:08 fierlion

Closing for inactivity. Please feel free to reopen if it still bothers you.

Realmonia avatar Dec 14 '22 00:12 Realmonia