amazon-ecs-agent
amazon-ecs-agent copied to clipboard
Incorrect memory utilization under cgroup2 on Amazon Linux 2022
Summary
The memory utilization metric reported to CloudWatch doesn't match the metric reported by the Docker CLI when running ECS containers on Amazon Linux 2022. This is a regression of #280.
Description
Amazon Linux 2022 now uses cgroup2 and as such the memory stats reported by Docker have changed from statsV1 to statsV2. Notably the statsV2 memory stats no longer include the cache property.
Since #582 the memory utilization reported to CloudWatch has been calculated as (memory_stats.usage - memory_stats.stats.cache) / memory_stats.limit which matched the Docker CLI at the time. However, with the cache property missing from the statsV2 memory stats, this calculation is no longer accurate.
The Docker CLI currently calculates the memory utilization as (memory_stats.usage - memory_stats.stats.total_inactive_file) / memory_stats.limit under cgroup1 and as (memory_stats.usage - memory_stats.stats.inactive_file) / memory_stats.limit under cgroup2.
Expected Behavior
The CloudWatch memory utilization metric should match the metric reported by the Docker CLI.
Observed Behavior
Container with 20.97% memory utilization according to the Docker CLI is reported as 54.30% memory utilization according to CloudWatch.
Environment Details
$ docker info
Client:
Context: default
Debug Mode: false
Server:
Containers: 3
Running: 3
Paused: 0
Stopped: 0
Images: 3
Server Version: 20.10.13
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9cc61520f4cd876b86e77edfeb88fbcd536d1f9d
runc version: f46b6ba2c9314cfc8caae24a32ec5fe9ef1059fe
init version: de40ad0
Security Options:
seccomp
Profile: default
cgroupns
Kernel Version: 5.15.43-20.123.amzn2022.x86_64
Operating System: Amazon Linux 2022
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.472GiB
Name: ip-10-2-128-116.us-west-2.compute.internal
ID: 6PVI:LERT:ND5G:UZKQ:7GG2:TTOF:OKQM:QFJW:EDPN:PS6V:ON22:YAR6
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Supporting Log Snippets
Docker stats JSON:
{
"memory_stats": {
"usage": 3985747968,
"stats": {
"active_anon": 0,
"active_file": 1489485824,
"anon": 970752,
"anon_thp": 0,
"file": 3935731712,
"file_dirty": 0,
"file_mapped": 0,
"file_writeback": 0,
"inactive_anon": 970752,
"inactive_file": 2446245888,
"kernel_stack": 16384,
"pgactivate": 367031,
"pgdeactivate": 0,
"pgfault": 8135005,
"pglazyfree": 0,
"pglazyfreed": 0,
"pgmajfault": 18948,
"pgrefill": 0,
"pgscan": 0,
"pgsteal": 0,
"shmem": 0,
"slab": 48933808,
"slab_reclaimable": 48735888,
"slab_unreclaimable": 197920,
"sock": 0,
"thp_collapse_alloc": 0,
"thp_fault_alloc": 0,
"unevictable": 0,
"workingset_activate": 0,
"workingset_nodereclaim": 0,
"workingset_refault": 0
}
"limit": 7340032000
}
}
Hi,
Thanks for reporting this! Looking into it.
I have instrumented the ecs-agent with logging to confirm that counting "inactive_file" instead of "cache" does in fact report a more accurate memory usage to cloudwatch, or at least it more accurately reflects what the docker-cli reports.
See memory usage from docker cli, and the "FIXED" memory usage below, which is using inactive_file instead of cache:
on AL2022:
src/github.com/aws/amazon-ecs-agent % docker stats f6bf24ec4486 --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
f6bf24ec4486 ecs-tomcat-sample-4-main1-f4e2c581ecd7d49c9201 0.05% 51.85MiB / 256MiB 20.25% 1.14kB / 0B 0B / 2.03MB 30
src/github.com/aws/amazon-ecs-agent % tail -f /var/log/ecs/ecs-agent.log
level=info time=2022-08-29T23:00:50Z msg="DEBUG USAGE: 51.8906 mb FIXED: 51.8516" module=utils_unix.go
level=info time=2022-08-29T23:00:51Z msg="DEBUG USAGE: 51.8906 mb FIXED: 51.8516" module=utils_unix.go
FWIW we also seem to have the same issue on AL2 (with respect to docker-cli memory usage accuracy). @ltm can you confirm if you only see this issue on AL2022? Does your application maybe create some "cache" memory that offsets the difference on AL2?
on AL2:
level=info time=2022-08-29T23:28:09Z msg="DEBUG PRE: 52.605 FIXED: 52.566" module=utils_unix.go
level=info time=2022-08-29T23:28:10Z msg="DEBUG PRE: 52.605 FIXED: 52.566" module=utils_unix.go
^C
CONTAINER INST src/github.com/aws/amazon-ecs-agent % docker stats fe5022acf01f --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
fe5022acf01f ecs-tomcat-sample-4-main1-dceffee1c4abc3cedf01 0.05% 52.57MiB / 256MiB 20.54% 820B / 0B 0B / 0B 31