OOM events not emitted
In the EC2 dmesg logs I see the following:
[Wed Feb 19 06:41:03 2025] Task in /ecs/f209da7f990b45b585488946f0eec404/15e20c7f5997947ffa3adb2b2bff199eea61004e29501677c53751b2c549bb15 killed as a result of limit of /ecs/f209da7f990b45b585488946f0eec404/15e20c7f5997947ffa3adb2b2bff199eea61004e29501677c53751b2c549bb15
[Wed Feb 19 06:41:03 2025] memory: usage 1048576kB, limit 1048576kB, failcnt 34667727
[Wed Feb 19 06:41:03 2025] memory+swap: usage 1048576kB, limit 2097152kB, failcnt 0
[Wed Feb 19 06:41:03 2025] kmem: usage 82888kB, limit 9007199254740988kB, failcnt 0
[Wed Feb 19 06:41:03 2025] Memory cgroup stats for /ecs/f209da7f990b45b585488946f0eec404/15e20c7f5997947ffa3adb2b2bff199eea61004e29501677c53751b2c549bb15: cache:760KB rss:964552KB rss_huge:0KB shmem:4KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:8KB active_anon:964780KB inactive_file:244KB active_file:8KB unevictable:0KB
[Wed Feb 19 06:41:03 2025] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[Wed Feb 19 06:41:03 2025] [28573] 0 28573 28100 26296 66 3 0 0 celery
[Wed Feb 19 06:41:03 2025] [21486] 0 21486 8506047 3590 133 6 0 0 chrome
[Wed Feb 19 06:41:03 2025] [21490] 0 21490 724 121 6 3 0 0 cat
[Wed Feb 19 06:41:03 2025] [21491] 0 21491 724 349 6 3 0 0 cat
[Wed Feb 19 06:41:03 2025] [21493] 0 21493 8395065 583 8 4 0 0 chrome_crashpad
[Wed Feb 19 06:41:03 2025] [21497] 0 21497 8457297 3887 86 4 0 0 chrome
[Wed Feb 19 06:41:03 2025] [21498] 0 21498 8457297 3771 87 4 0 0 chrome
[Wed Feb 19 06:41:03 2025] [21515] 0 21515 8475320 3584 108 4 0 0 chrome
[Wed Feb 19 06:41:03 2025] [ 7794] 0 7794 8503438 3568 137 6 0 0 chrome
[Wed Feb 19 06:41:03 2025] [ 7799] 0 7799 724 136 5 3 0 0 cat
[Wed Feb 19 06:41:03 2025] [ 7800] 0 7800 724 285 6 3 0 0 cat
[Wed Feb 19 06:41:03 2025] [ 7802] 0 7802 8395065 555 8 4 0 0 chrome_crashpad
[Wed Feb 19 06:41:03 2025] [ 7806] 0 7806 8457296 3947 85 5 0 0 chrome
[Wed Feb 19 06:41:03 2025] [ 7807] 0 7807 8457296 3718 84 5 0 0 chrome
[Wed Feb 19 06:41:03 2025] [ 7826] 0 7826 8475320 3480 108 4 0 0 chrome
[Wed Feb 19 06:41:03 2025] [ 6936] 0 6936 8503470 3575 133 5 0 0 chrome
[Wed Feb 19 06:41:03 2025] [ 6940] 0 6940 724 120 5 3 0 0 cat
[Wed Feb 19 06:41:03 2025] [ 6941] 0 6941 724 347 5 3 0 0 cat
[Wed Feb 19 06:41:03 2025] [ 6987] 0 6987 8395065 619 7 4 0 0 chrome_crashpad
[Wed Feb 19 06:41:03 2025] [ 6993] 0 6993 8457297 3915 86 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 6994] 0 6994 8457296 3618 86 5 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 7011] 0 7011 8472759 2923 105 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 9757] 0 9757 8518836 3606 137 7 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 9761] 0 9761 724 137 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [ 9762] 0 9762 724 367 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [ 9764] 0 9764 8395065 584 8 4 0 0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [ 9768] 0 9768 8457297 3898 86 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 9769] 0 9769 8457297 3945 87 5 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 9794] 0 9794 8472759 2958 107 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [24906] 0 24906 8500909 3435 132 5 0 0 chrome
[Wed Feb 19 06:41:04 2025] [24910] 0 24910 724 131 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [24911] 0 24911 724 306 6 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [24915] 0 24915 8395065 586 7 4 0 0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [24968] 0 24968 8457297 3949 86 5 0 0 chrome
[Wed Feb 19 06:41:04 2025] [24976] 0 24976 8457297 3792 88 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [25005] 0 25005 8472759 2979 106 5 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 8875] 0 8875 8503454 3452 131 5 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 8879] 0 8879 724 137 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [ 8880] 0 8880 724 336 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [ 8882] 0 8882 8395065 584 8 4 0 0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [ 8886] 0 8886 8457297 3906 87 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 8887] 0 8887 8457298 3707 85 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 8909] 0 8909 8475320 2947 107 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [21034] 0 21034 8500909 3458 130 6 0 0 chrome
[Wed Feb 19 06:41:04 2025] [21038] 0 21038 724 120 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [21039] 0 21039 724 348 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [21047] 0 21047 8395065 531 8 4 0 0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [21051] 0 21051 8457297 3634 86 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [21052] 0 21052 8457298 3957 86 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [21410] 0 21410 8472759 3325 106 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [29003] 0 29003 8503454 3569 134 5 0 0 chrome
[Wed Feb 19 06:41:04 2025] [29007] 0 29007 724 121 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [29008] 0 29008 724 276 6 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [29010] 0 29010 8395065 590 7 4 0 0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [29038] 0 29038 8457297 3926 88 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [29039] 0 29039 8457297 3435 86 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [29130] 0 29130 8472759 3017 107 6 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 8316] 0 8316 8503486 3458 133 5 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 8320] 0 8320 724 136 6 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [ 8321] 0 8321 724 351 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [ 8323] 0 8323 8395065 598 6 4 0 0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [ 8327] 0 8327 8457297 3933 87 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 8328] 0 8328 8457297 3852 87 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 8350] 0 8350 8475320 2930 107 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [20467] 0 20467 8503470 3462 131 5 0 0 chrome
[Wed Feb 19 06:41:04 2025] [20471] 0 20471 724 119 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [20472] 0 20472 724 360 6 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [20505] 0 20505 8395065 612 8 4 0 0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [20509] 0 20509 8457297 3968 86 5 0 0 chrome
[Wed Feb 19 06:41:04 2025] [20510] 0 20510 8457298 3989 87 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [20713] 0 20713 8472759 3312 106 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [15866] 0 15866 8503470 3471 135 6 0 0 chrome
[Wed Feb 19 06:41:04 2025] [15871] 0 15871 724 136 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [15872] 0 15872 724 374 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [15881] 0 15881 8395065 574 8 4 0 0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [15885] 0 15885 8457297 3872 87 5 0 0 chrome
[Wed Feb 19 06:41:04 2025] [15886] 0 15886 8457297 3661 87 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [16787] 0 16787 8472759 3222 105 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [15892] 0 15892 8503438 3502 130 5 0 0 chrome
[Wed Feb 19 06:41:04 2025] [15896] 0 15896 724 136 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [15897] 0 15897 724 92 5 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [15899] 0 15899 8395065 622 7 4 0 0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [15903] 0 15903 8457297 3915 86 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [15904] 0 15904 8457296 3834 85 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [15926] 0 15926 8472759 3282 107 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 7873] 0 7873 8503438 3501 132 5 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 7877] 0 7877 724 120 6 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [ 7878] 0 7878 724 110 6 3 0 0 cat
[Wed Feb 19 06:41:04 2025] [ 7880] 0 7880 8395065 519 7 4 0 0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [ 7884] 0 7884 8457297 3871 87 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 7885] 0 7885 8457298 3780 87 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [ 7915] 0 7915 8472759 2915 106 4 0 0 chrome
[Wed Feb 19 06:41:04 2025] [23440] 0 23440 336235 52523 264 4 0 0 celery
[Wed Feb 19 06:41:04 2025] [31521] 0 31521 16229 14441 35 3 0 0 celery
[Wed Feb 19 06:41:04 2025] Memory cgroup out of memory: Kill process 23440 (celery) score 199 or sacrifice child
[Wed Feb 19 06:41:04 2025] Killed process 23440 (celery) total-vm:1344940kB, anon-rss:202336kB, file-rss:7748kB, shmem-rss:0kB
[Wed Feb 19 06:41:04 2025] oom_reaper: reaped process 23440 (celery), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
However in CloudTrail, there is no OOM SubmitTaskChange event emitted that should trigger the related EventBridge alert rule.
ecs-service-connect-agent-v1.29.6.0-1.amzn2.x86_64
ecs-init-1.86.3-1.amzn2.x86_64
Hello! Per the public docs
Task state change events are delivered in the following format. The detail section below resembles the Task object that is returned from a DescribeTasks API operation in the Amazon Elastic Container Service API Reference. If your containers are using an image hosted with Amazon ECR, the imageDigest field is returned.
Could you double check that your eventbridge rule can translate the task object that's returned by the DescribeTask API call? From what I can see on our side, ECS agent did seem to have sent over an STSC for this which our backend has ACK'd which suggests that the agent container/ECS service on the host itself is working as intended here.
Related docs/info:
- See also DescribeTask response syntax -> https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_DescribeTasks.html#API_DescribeTasks_ResponseSyntax)
- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_task_events.html
- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_service_events.html
Hope this helps.
Thanks for following up @mye956. I'm not sure how you were able to confirm this on your side since you don't have any idea about the AWS account or anything else from the snippet above, but if that's really the case I'd guess you relied on the task IDs from the first line:
Task in /ecs/f209da7f990b45b585488946f0eec404/15e20c7f5997947ffa3adb2b2bff199eea61004e29501677c53751b2c549bb15 killed as a result of limit of /ecs/f209da7f990b45b585488946f0eec404/15e20c7f5997947ffa3adb2b2bff199eea61004e29501677c53751b2c549bb15
Back to your question, I'm not even looking at the EventBridge rule - I was looking at CloudTrail ECS events and in there I saw those events were not emitted.
The EventBridge rule that I have is as simple as:
const rule = new Rule(this, 'EcsContainerOOMRestartRule', {
ruleName: 'EcsTaskContainerRestartedOOMRule',
eventPattern: {
source: ['aws.ecs'],
detailType: ['ECS Task State Change'],
detail: {
lastStatus: ['STOPPED'],
stoppedReason: ['OutOfMemoryError: container killed due to memory usage'],
},
},
});
and then if I don't see the event in CloudTrail then it means it wasn't there and I can't look at it's properties.
One thing that is on my mind, that could be happening here. The task / container that is causing this, has a celery broker that is running child processes as queue task consumers - and this is the process that is being killed, not the main container process.
The question is, if in that case ECS would still emit the event (since the OOM did happen, even though not on the task level)
OR
because the parent process is still running, so the task state did not change, this is an expected behavior and I need to add CloudWatch log monitor that looks for literal log message Kill process like this:
const sigkillMetricFilter = new MetricFilter(this, 'SigkillMetricFilter', {
filterPattern: FilterPattern.literal('SIGKILL'),
logGroup: this.logGroup,
metricNamespace,
metricName: 'SigkillCount',
metricValue: '1',
});
// Create a metric using the filter
const sigkillMetric = new Metric({
namespace: metricNamespace,
metricName: 'SigkillCount',
statistic: 'Sum',
period: Duration.minutes(5),
});
const sigkillAlarm = new Alarm(this, 'SigkillAlarm', {
alarmName: 'SIGKILL-Alarm',
metric: sigkillMetric,
threshold: 1,
evaluationPeriods: 1,
comparisonOperator: ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
treatMissingData: TreatMissingData.NOT_BREACHING,
});
Hi @rantoniuk , yep that's right I just tried tracing back using the task ID from the logs that was provided.
One thing that is on my mind, that could be happening here. The task / container that is causing this, has a celery broker that is running child processes as queue task consumers - and this is the process that is being killed, not the main container process.
Just to clarify, we're seeing that the task container doesn't get OOM killed but a child process of it? Do we see the container exiting? If so, is it the only container within the task and/or an essential container?
Would it be better to match stopCode? Maybe the stoppedReason isn't an exact match?
Ref:
- https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Task.html#ECS-Type-Task-stopCode (note that the API docs probably need recompiled because the list here is shorter than the list below)
- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/stopped-task-error-codes.html
also:
and this is the process that is being killed, not the main container process.
Is your container's root process something that speaks init (e.g., initProcessEnabled is true, or you are running another compatible init daemon)? The root process of a container has to do some process accounting stuff that might not be getting done. Does ECS show the task as stopped? (Maybe ECS won't emit the event if the container remains alive... which might happen if you're not running an init.)
Does ECS show the task as stopped? (Maybe ECS won't emit the event if the container remains alive... which might happen if you're not running an init.)
No, the task is not_ stopped, only a child process of the task is getting killed - but the relevant dmesg message is visible. It should not be relevant whether the ECS task is stopped or not, since here we are operating on container resource limits - if the resource limit is set and any process is killed because of it, as in this case, an event should be emitted.
Is your container's root process something that speaks init (e.g., initProcessEnabled is true, or you are running another compatible init daemon)?
It's a queueing system, that spawns a child task for each new queue request.
The container final CMD looks like this:
CMD ["celery", "-A", "django_backend", "worker", "-P", "gevent", "-Ofair", "--loglevel", "info", "--concurrency", "1", "-Q", "gpu_q"]
It should not be relevant whether the ECS task is stopped or not, since here we are operating on container resource limits
I would assume that ECS would have to see the state change of the task to emit an event to EventBridge/CloudTrail, since the event is about the task's state, not a subprocess of that task.
It's a queueing system, that spawns a child task for each new queue request.
That's totally fine, but if celery is PID 1 in the container, it will be expected to do more book-keeping with processes than it would ordinarily. This Yelp blog post from 2016 goes into detail about why this is the case and what would happen with signal handling without doing this additional book-keeping. Rather than use dumb-init, however, you can just use --init in Docker, which in ECS is initProcessEnabled, unless you are on a very old version of Docker.
The reason I bring this up is that if a subprocess goes OOM and is killed, but PID 1 in the container isn't doing its job, you could have a container up that consists mostly/entirely of dead processes, and neither Docker nor ECS would be able to know that since the signals weren't propagated correctly to the Linux kernel. This is a weird UNIX thing that bit me many years ago and should be covered in the Docker docs a lot better than it is 😅.
If ECS does actually send events for container subprocess failures, then I would expect running an init process would make those events appear. I don't think it does this, though, because the ECS task doesn't stop for me whenever I have a subprocess crash, and you can imagine that it would be really disruptive for a web app server to do this when it could respawn a worker with less disruption. Because of that, I think that you're only going to see state changes in ECS if the task itself is seen as failed.. and it might be that the task itself being seen as failed isn't happening because of the lack of an init daemon at PID 1.
If you're wanting to get alerted on OOMs, I would run fluent-bit or CloudWatch Agent and send the systemd journal to CloudWatch Logs (optionally filter the sender for only OOM events).
I would assume that ECS would have to see the state change of the task to emit an event to EventBridge/CloudTrail, since the event is about the task's state, not a subprocess of that task.
If ECS does actually send events for container subprocess failures, then I would expect running an init process would make those events appear. I don't think it does this, though, because the ECS task doesn't stop for me whenever I have a subprocess crash, and you can imagine that it would be really disruptive for a web app server to do this when it could respawn a worker with less disruption
I do follow your thoughts about signals, PID 1 and child tasks, but I think you're missing my point here.
You're looking at this from ECS task perspective (and I'm not saying you're wrong given that we're in amazon-ecs-agent GH repository scope). Indeed, the task is not stopped, as the parent process is running.
However, my point is different. I am expecting ECS service to be able to see and react on OOM kernel events that are happening, because I can see them in dmesg.
and you can imagine that it would be really disruptive for a web app server to do this when it could respawn a worker with less disruption
That's a valid point, but there is a significant difference between a process that terminates gracefully and is respawned (child job tasks) and a KILL that is enforced via a resource limit on the docker container/cgroup level. In other words, if there is a limit set in docker and that limit is exceeded and something is killed because of it - and that is visible in dmesg - then I do want to get notified about it because it most probably needs raising the limits.
Because of that, I think that ECS should emit somehow the event to EventBridge in such case - whether it's consumed from there that's another story.
See also https://github.com/aws/amazon-ecs-ami/issues/240 - I think those issues are related.
Hope that makes sense.