amazon-ecs-agent
amazon-ecs-agent copied to clipboard
Docker CPU Usage Increases Over Time
Summary
CPU usage increases over time using short-lived tasks connected to EFS.
Description
We have a slightly unusual set-up where we have maybe 10 short-lived tasks running every minute. These tasks mount 2 EFS volumes and do what they need to do, then shut down again. Not all tasks are equal, but none run longer than 10 minutes, and most well under 1 minute. There are lots of times when no tasks are running at all.
We notice that CPU usage, as well as I/O read/write creeps up over time. The graph below shows the increase on a t3.small instance after launch yesterday.
The number of tasks and frequency is not changing.
systemctl restart docker
causes the CPU to go down again, then it will start to creep up as before.
Expected Behavior
CPU / IO to stay steady.
Observed Behavior
CPU / IO slowly increases.
Environment Details
[root@ip-10-0-16-34 bin]# docker info
Client:
Context: default
Debug Mode: false
Server:
Containers: 35
Running: 3
Paused: 0
Stopped: 32
Images: 6
Server Version: 20.10.7
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: amazon-ecs-volume-plugin local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc version: 84113eef6fc27af1b01b3181f31bbaf708715301
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.256-197.484.amzn2.x86_64
Operating System: Amazon Linux 2
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.908GiB
Name: ip-10-0-16-34.eu-west-1.compute.internal
ID: MV2W:RZIE:YOBU:UOUS:7KLN:737Q:L577:IZ4Q:B2IA:ZKWH:3HHJ:OBJ3
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
[root@ip-10-0-16-34 /]# cat /etc/ecs/ecs.config
ECS_CLUSTER=dev-xxxxxx-cli-1
ECS_CONTAINER_STOP_TIMEOUT=10s
ECS_ENABLE_SPOT_INSTANCE_DRAINING=true
ECS_ENABLE_TASK_IAM_ROLE=true
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=10m
ECS_IMAGE_PULL_BEHAVIOR=prefer-cached
ECS_RESERVED_MEMORY=200
[root@ip-10-0-16-34 /]# curl -http://localhost:51678/v1/metadata
{"Cluster":"dev-xxxxxx-cli-1","ContainerInstanceArn":"arn:aws:ecs:eu-west-1:xxxxxxxxxxxx:container-instance/dev-xxxxxx-cli-1/61f449ba47e1456da3723509768642d0","Version":"Amazon ECS Agent - v1.58.0 (80ad4ca4)"}
[root@ip-10-0-16-34 /]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 970M 0 970M 0% /dev
tmpfs 978M 0 978M 0% /dev/shm
tmpfs 978M 3.0M 975M 1% /run
tmpfs 978M 0 978M 0% /sys/fs/cgroup
/dev/nvme0n1p1 100G 5.2G 95G 6% /
overlay 100G 5.2G 95G 6% /var/lib/docker/overlay2/4efe78757cb957c42ffe401ba33f053e3f91748a2dbc8535421dce9572d8d2b2/merged
shm 64M 0 64M 0% /var/lib/docker/containers/e41525561b6bd9ff16a8409c86bf92b9521df636fe4992b279fafd872927af7c/mounts/shm
overlay 100G 5.2G 95G 6% /var/lib/docker/overlay2/d88d51191a43d506e4e1e395387c6ec3d5b3672e01ca29d95af5bfc2941b168e/merged
shm 64M 0 64M 0% /var/lib/docker/containers/c93fa0d857d8d1817d67eb3e8a48a7e5a5648ce0aec0ead8f0b74b034211dc06/mounts/shm
overlay 100G 5.2G 95G 6% /var/lib/docker/overlay2/57051246ac2e59aca51562ca20511448b61f0bc688f017ab63791179c25ae9f2/merged
shm 64M 0 64M 0% /var/lib/docker/containers/a94ccc4641109bd81cdb6b682f80517698fedf4457b61511d403b663e2133b89/mounts/shm
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-9a8ae5c0ffb0e09c9f01
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-c8bb82de8cd0eda28001
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-a2cec5ff969afe8ec801
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-90dae2aad7efeda6c201
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-ec81ea8a9aaef0820700
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-d2edcbc8e0dad8ef6a00
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-eecbf2a393dbd5d40500
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-c2d7ffe4ce9daff9b001
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-feb99ed1e1dbd786de01
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-e0a5ccdf9efd80aaed01
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-84e599c9d7ffd5e77200
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-d6bab1e6f99d8e86de01
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-80b99494d5d6bcd1fc01
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-949be4d18f83efadcd01
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-94fbc6c8b785a1c10e00
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-e282e189dfceafe2e301
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-ae839cf19dd2c3a33d00
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-d8fed99b87a3a7d29801
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-acac94addfbad3f8e501
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-cce2ddb18ec193800a00
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-e4e8d09adcd5e6fc7900
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-eeba9acdac83c0d01c00
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-a0abe9f8a2ba95e73b00
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-9ec6e2baeae9f28ac201
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-f0b1daac81c6f3d25700
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-ea8394f29e9f8fc66e00
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-9aabf2e4849083860f00
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-92bceba498e797b30700
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-9695809b82dd80846d00
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-f286d4c7e19f80e7ea01
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-86e5b2e8eed3eac68401
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-fa9a86ae9692b3c32300
tmpfs 196M 0 196M 0% /run/user/0
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-aaf0ccd6f78a95e05800
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-80f3a5cbecb6d6b4e701
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-a0f4e792c2fdc7e83900
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-bca2e483aff7c3a66100
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-e680b8a093c7fd8e0a00
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-def795f1ddd2edcfde01
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-aa8cb19cdbf1e6f9ae01
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-88d5ceebdfc8a6d4bf01
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-d28cb1aaefedb4f9c701
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-appdata-86e2e3ebd4e9a4d43800
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-98ed8691e589fbf80d00
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-88abdafed6d3eabdc301
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-xxxxxxdemoqauktest-ecs-xxxxxx-apache-cli-5-ftpdata-86ba98a4e3e3f1c52600
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-e4e3d3dbb387afef0600
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-ftpdata-80e0f4b493bafe9bb401
127.0.0.1:/ 8.0E 139M 8.0E 1% /var/lib/ecs/volumes/ecs-dev-stewart-xxxxxx-apache-cli-6-appdata-b6d38f9cd0b8eea92400
Supporting Log Snippets
would rather not post logs publicly
Thanks for any help.
Hi, Thanks for submitting the issue. To help us debug, could you send the following information via email to [email protected]:
- task definition used
- instance log collected via ecs log collector on the instance that was having the issue.
thanks. in the meantime i will try to repro the issue
Thanks for getting back to me. I have sent the files.
For reference, the load is still increasing.
Thanks for the information. I was able to repro the issue by running 10 short-lived tasks per minute that uses EFS volume, and observed that the cpu usage increases slowly but consistently over time, despite task launching rate did not change. will be taking a deeper look into what is consuming the cpu
upon further investigation, it appears that the repro i mentioned in previous comment was not valid. specifically, i was using ps
to measure the cpu percentage, which, according to man ps, does not reflect the real time cpu usage, but instead the percentage averaged over process lifetime:
CPU usage is currently expressed as the percentage of time spent
running during the entire lifetime of a process.
so when cpu intensive task started, the averaged cpu usage percentage used by dockerd shown by ps
continues growing. when i switch to use top
which shows the cpu percentage in real time, i was able to observe that the cpu usage reaches similar level when a task is running, and drops to 0 when no task is run. i wonder if you might be having similar issue, as the cpu usage output by ps
on your instance (collected in the logs you sent) is similar to what you have in the screenshot. so to confirm, how are you measuring the cpu usage on docker exactly? @stewartcampbell
I'm using new Relic infrastructure agent. You can see the same increase using CloudWatch metrics though at the CPU level.
If I restart the Docker daemon or the EC2 instance is replaced by the capacity provider, the average CPU load on the new instance will be low again.
I should also note that even after removing the NR agent, the CPU increase still occurs, so it isn't related to the NR agent.
Hi! Following up on this issue.
@stewartcampbell, are you certain what is consuming the CPU is the docker daemon? How are you able to determine it's dockerd vs anything else on the instance?
If we're certain the issue is in dockerd, we can try profiling it to see if we can pinpoint the issue. In order to do so, I need you to do the following in one of your instances with the problem:
1. Enable debug mode for docker daemon:
- if the file
/etc/docker/daemon.json
DOESN'T exist, create it and append the debug flag using the command below. Otherwise, just edit the file and add"debug": true
to the existing JSON.
sudo echo '{"debug": true}' > /etc/docker/daemon.json
- Reload docker
sudo kill -SIGHUP $(pidof dockerd)
- Verify docker is running
docker ps
2. Bind dockerd's pprof endpoint to local port
socat -d -d TCP-LISTEN:8181,fork,bind=127.0.0.1 UNIX:/var/run/docker.sock &
3. Generate profiles
- Generate a CPU profile. The command will collect CPU data in the background for 30 minutes. Hopefully this will be sufficient to see what in docker is doing. Make sure that you execute the command below and that your instance is running your tasks normally for at least 30 minutes (i.e. 10 short-lived tasks per minute, as described in this issue's summary).
curl "http://localhost:8181/debug/pprof/profile?seconds=1800" --output dockerd-30m.profile &
- Generate goroutine profile
The goal of this command is to see if docker is leaking goroutines. Since we can only see this after the fact, please make sure you execute this command at least 30 minutes AFTER your tasks have been running normally.
curl "http://localhost:8181/debug/pprof/goroutine" --output dockerd.goroutine
4. Send us the data
Please send the dockerd-30m.profile
, dockerd.goroutine
generated in step #3, as well as the full ecs log suite to [email protected] for analysis.
Hi. I have sent the email as requested.
I can see the Docker daemon is increasing using the New Relic infrastructure agent. The instance I was working on recycled earlier, so I don't have a long history for it, but you can see from the image below that the CPU usage for the docker process is slowly increasing. Images I have posted earlier in this issue show it increasing over a longer period.
Hi, I was going over the profiles but nothing really stood out. The CPU profile shows docker is busy on regular stuff and nothing seems related to efs. (See image below for CPU visualization).
Similarly, the goroutine profile looks normal, there are not very many active goroutines, bit this could be because the profile was taken a bit early.
Any chance you can repeat the exercise in the same instance so that we can see the change over time? If we can't tie this to efs plugin, I think it probably be a good idea to open an issue with docker directly to see if they're aware of any CPU issues.
It would be great if we could workout a way to repro since so far this only seem to be happening with your specific work-load traffic patterns, as we could not repro by way of starting 10 tasks per minute with efs volumes attached.
Thanks for getting back to me. I haven't forgotten about this. Will need to pick it back up next week.
I have added a repo at https://github.com/stewartcampbell/poc-ecs-cpu-issue which will reproduce the issue for you. It is stand alone and will deploy a VPC, ECS cluster etc as well as schedule tasks. It is written in Terraform, and all you need to change is the AWS profile name in provider.tf
.
I deployed the same yesterday and let it run overnight. You can see the slow increase of CPU activity below.
Hopefully this will help you prove that it is happening. It uses the latest PHP 7.4 CLI image from ECR public, but doesn't actually run PHP. It just runs sleep 10
. Just make sure to run it for a few days as the CPU increase is only gradual.
I need to tear down this test stack now, however as can be seen below, the CPU usage is still rising. It does seem to be slowing down, though.
Hi @angelcar did you ever get a chance to reproduce this?
@stewartcampbell following up on this -- I do see that you've got steps to repro which do include 2 EFS mount points. As @angelcar mentioned
Any chance you can repeat the exercise in the same instance so that we can see the change over time? If we can't tie this to efs plugin, I think it probably be a good idea to open an issue with docker directly to see if they're aware of any CPU issues.
We want to pinpoint whether this is specific to your workload, or whether it's tied to the EFS plugin, or whether it's something else altogether. If you've been able to capture any clear profiling data to point us in a direction, that would be helpful. I'll also try on my end to try out Angel's profiling scenario with the terraform setup you've provided.
Closing for inactivity. Please feel free to reopen if it still bothers you.