amazon-ecs-agent icon indicating copy to clipboard operation
amazon-ecs-agent copied to clipboard

Incorrect memory utilization under cgroup2 on Amazon Linux 2022

Open ltm opened this issue 3 years ago • 1 comments

Summary

The memory utilization metric reported to CloudWatch doesn't match the metric reported by the Docker CLI when running ECS containers on Amazon Linux 2022. This is a regression of #280.

Description

Amazon Linux 2022 now uses cgroup2 and as such the memory stats reported by Docker have changed from statsV1 to statsV2. Notably the statsV2 memory stats no longer include the cache property.

Since #582 the memory utilization reported to CloudWatch has been calculated as (memory_stats.usage - memory_stats.stats.cache) / memory_stats.limit which matched the Docker CLI at the time. However, with the cache property missing from the statsV2 memory stats, this calculation is no longer accurate.

The Docker CLI currently calculates the memory utilization as (memory_stats.usage - memory_stats.stats.total_inactive_file) / memory_stats.limit under cgroup1 and as (memory_stats.usage - memory_stats.stats.inactive_file) / memory_stats.limit under cgroup2.

Expected Behavior

The CloudWatch memory utilization metric should match the metric reported by the Docker CLI.

Observed Behavior

Container with 20.97% memory utilization according to the Docker CLI is reported as 54.30% memory utilization according to CloudWatch.

Environment Details

$ docker info
Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 3
  Running: 3
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 20.10.13
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cc61520f4cd876b86e77edfeb88fbcd536d1f9d
 runc version: f46b6ba2c9314cfc8caae24a32ec5fe9ef1059fe
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.43-20.123.amzn2022.x86_64
 Operating System: Amazon Linux 2022
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 7.472GiB
 Name: ip-10-2-128-116.us-west-2.compute.internal
 ID: 6PVI:LERT:ND5G:UZKQ:7GG2:TTOF:OKQM:QFJW:EDPN:PS6V:ON22:YAR6
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Supporting Log Snippets

Docker stats JSON:

{
  "memory_stats": {
    "usage": 3985747968,
    "stats": {
      "active_anon": 0,
      "active_file": 1489485824,
      "anon": 970752,
      "anon_thp": 0,
      "file": 3935731712,
      "file_dirty": 0,
      "file_mapped": 0,
      "file_writeback": 0,
      "inactive_anon": 970752,
      "inactive_file": 2446245888,
      "kernel_stack": 16384,
      "pgactivate": 367031,
      "pgdeactivate": 0,
      "pgfault": 8135005,
      "pglazyfree": 0,
      "pglazyfreed": 0,
      "pgmajfault": 18948,
      "pgrefill": 0,
      "pgscan": 0,
      "pgsteal": 0,
      "shmem": 0,
      "slab": 48933808,
      "slab_reclaimable": 48735888,
      "slab_unreclaimable": 197920,
      "sock": 0,
      "thp_collapse_alloc": 0,
      "thp_fault_alloc": 0,
      "unevictable": 0,
      "workingset_activate": 0,
      "workingset_nodereclaim": 0,
      "workingset_refault": 0
    }
    "limit": 7340032000
  }
}

ltm avatar Jul 29 '22 15:07 ltm

Hi,

Thanks for reporting this! Looking into it.

Realmonia avatar Aug 08 '22 22:08 Realmonia

I have instrumented the ecs-agent with logging to confirm that counting "inactive_file" instead of "cache" does in fact report a more accurate memory usage to cloudwatch, or at least it more accurately reflects what the docker-cli reports.

See memory usage from docker cli, and the "FIXED" memory usage below, which is using inactive_file instead of cache:

on AL2022:

src/github.com/aws/amazon-ecs-agent % docker stats f6bf24ec4486 --no-stream
CONTAINER ID   NAME                                             CPU %     MEM USAGE / LIMIT   MEM %     NET I/O       BLOCK I/O     PIDS
f6bf24ec4486   ecs-tomcat-sample-4-main1-f4e2c581ecd7d49c9201   0.05%     51.85MiB / 256MiB   20.25%    1.14kB / 0B   0B / 2.03MB   30

src/github.com/aws/amazon-ecs-agent % tail -f /var/log/ecs/ecs-agent.log   
level=info time=2022-08-29T23:00:50Z msg="DEBUG USAGE: 51.8906 mb    FIXED: 51.8516" module=utils_unix.go
level=info time=2022-08-29T23:00:51Z msg="DEBUG USAGE: 51.8906 mb    FIXED: 51.8516" module=utils_unix.go

sparrc avatar Aug 29 '22 23:08 sparrc

FWIW we also seem to have the same issue on AL2 (with respect to docker-cli memory usage accuracy). @ltm can you confirm if you only see this issue on AL2022? Does your application maybe create some "cache" memory that offsets the difference on AL2?

on AL2:

level=info time=2022-08-29T23:28:09Z msg="DEBUG PRE: 52.605      FIXED: 52.566" module=utils_unix.go
level=info time=2022-08-29T23:28:10Z msg="DEBUG PRE: 52.605      FIXED: 52.566" module=utils_unix.go
^C
CONTAINER INST src/github.com/aws/amazon-ecs-agent % docker stats fe5022acf01f --no-stream
CONTAINER ID   NAME                                             CPU %     MEM USAGE / LIMIT   MEM %     NET I/O     BLOCK I/O   PIDS
fe5022acf01f   ecs-tomcat-sample-4-main1-dceffee1c4abc3cedf01   0.05%     52.57MiB / 256MiB   20.54%    820B / 0B   0B / 0B     31

sparrc avatar Aug 29 '22 23:08 sparrc