amazon-ecs-agent icon indicating copy to clipboard operation
amazon-ecs-agent copied to clipboard

Huge memory available diff reported by ECS agent

Open RobinFrcd opened this issue 3 years ago • 2 comments

Description

I'm running an ECS cluster on c6i.xlarge and inf1.xlarge instances.

I noticed there's often a huge difference on the memory shown by ECS agent and htop or free -m . For example, right now on the inferentia instance ECS shows Memory available: 3159 but free -m shows:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7679        5879         850         248         949        1424
Swap:             0           0           0

And htop: image

I'm running TorchServe, and I've set some memoryReservation=3170, memory=3487 and ulimits:

"ulimits": [
  {"name": "memlock", "SoftLimit": -1, "hardLimit": -1},
  {"name": "stack", "SoftLimit": 67108864, "hardLimit": 67108864}
]

Is there something in my config that could explain this gap between different RAM usage reports ?

I'm running ECS agent v1.61.3 and docker v20.10.13.

This is a real issue to me because sometimes the whole EC2 instance running the ECS agent becomes unresponsive and all services running on it crash. I suspect this is because a service has a RAM usage spike and the memory limit of the services are not taken into account correctly.

Thanks,

RobinFrcd avatar Aug 08 '22 17:08 RobinFrcd

Have you read this article in our ECS developer docs about how ECS manages memory? https://docs.aws.amazon.com/AmazonECS/latest/developerguide/memory-management.html

"The Registered memory value is what the container instance registered with Amazon ECS when it was first launched, and the Available memory value is what has not already been allocated to tasks."

So you might have more/less available memory in your instance than ECS sees, but ECS is counting just the memory from its registered tasks per container instance.

fierlion avatar Aug 30 '22 16:08 fierlion

The system at start without any container uses 600M out of the 7.5G reported by htop (and 7423M by ECS agent).

When all the services are running, docker is reporting 4.6G used by the containers, while htop says 7G total RAM usage. 600M (system) + 4.6G (docker) = 5.2G I don't understand what is using the remaining 1.8G :thinking: Any idea ?

There's no way to ask the agent to use free -b instead of Docker ReadMemInfo(), right ? So I guess my issue is with docker and not ECS.

RobinFrcd avatar Aug 31 '22 08:08 RobinFrcd

I think that's the right understanding. the registered memory for ECS is from ReadMemInfo. Currently the agent implements the readMemInfo in code - https://github.com/aws/amazon-ecs-agent/blob/a768d2ac22c0870cfd6d1945c668aa2526797d20/agent/api/ecsclient/client.go#L310

I think you can try to build a customized agent with free -b instead of ReadMemInfo if you want and test with it; since we don't see report of other occurrence of this issue we will keep it as is now. Please let us know the status!

Realmonia avatar Dec 14 '22 00:12 Realmonia

I am closing this one for now, please reopen if you have more things that need us to help investigate!

Realmonia avatar Dec 15 '22 18:12 Realmonia