amazon-ecs-agent icon indicating copy to clipboard operation
amazon-ecs-agent copied to clipboard

OOM events not emitted

Open rantoniuk opened this issue 10 months ago • 7 comments

In the EC2 dmesg logs I see the following:

[Wed Feb 19 06:41:03 2025] Task in /ecs/f209da7f990b45b585488946f0eec404/15e20c7f5997947ffa3adb2b2bff199eea61004e29501677c53751b2c549bb15 killed as a result of limit of /ecs/f209da7f990b45b585488946f0eec404/15e20c7f5997947ffa3adb2b2bff199eea61004e29501677c53751b2c549bb15
[Wed Feb 19 06:41:03 2025] memory: usage 1048576kB, limit 1048576kB, failcnt 34667727
[Wed Feb 19 06:41:03 2025] memory+swap: usage 1048576kB, limit 2097152kB, failcnt 0
[Wed Feb 19 06:41:03 2025] kmem: usage 82888kB, limit 9007199254740988kB, failcnt 0
[Wed Feb 19 06:41:03 2025] Memory cgroup stats for /ecs/f209da7f990b45b585488946f0eec404/15e20c7f5997947ffa3adb2b2bff199eea61004e29501677c53751b2c549bb15: cache:760KB rss:964552KB rss_huge:0KB shmem:4KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:8KB active_anon:964780KB inactive_file:244KB active_file:8KB unevictable:0KB
[Wed Feb 19 06:41:03 2025] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[Wed Feb 19 06:41:03 2025] [28573]     0 28573    28100    26296      66       3        0             0 celery
[Wed Feb 19 06:41:03 2025] [21486]     0 21486  8506047     3590     133       6        0             0 chrome
[Wed Feb 19 06:41:03 2025] [21490]     0 21490      724      121       6       3        0             0 cat
[Wed Feb 19 06:41:03 2025] [21491]     0 21491      724      349       6       3        0             0 cat
[Wed Feb 19 06:41:03 2025] [21493]     0 21493  8395065      583       8       4        0             0 chrome_crashpad
[Wed Feb 19 06:41:03 2025] [21497]     0 21497  8457297     3887      86       4        0             0 chrome
[Wed Feb 19 06:41:03 2025] [21498]     0 21498  8457297     3771      87       4        0             0 chrome
[Wed Feb 19 06:41:03 2025] [21515]     0 21515  8475320     3584     108       4        0             0 chrome
[Wed Feb 19 06:41:03 2025] [ 7794]     0  7794  8503438     3568     137       6        0             0 chrome
[Wed Feb 19 06:41:03 2025] [ 7799]     0  7799      724      136       5       3        0             0 cat
[Wed Feb 19 06:41:03 2025] [ 7800]     0  7800      724      285       6       3        0             0 cat
[Wed Feb 19 06:41:03 2025] [ 7802]     0  7802  8395065      555       8       4        0             0 chrome_crashpad
[Wed Feb 19 06:41:03 2025] [ 7806]     0  7806  8457296     3947      85       5        0             0 chrome
[Wed Feb 19 06:41:03 2025] [ 7807]     0  7807  8457296     3718      84       5        0             0 chrome
[Wed Feb 19 06:41:03 2025] [ 7826]     0  7826  8475320     3480     108       4        0             0 chrome
[Wed Feb 19 06:41:03 2025] [ 6936]     0  6936  8503470     3575     133       5        0             0 chrome
[Wed Feb 19 06:41:03 2025] [ 6940]     0  6940      724      120       5       3        0             0 cat
[Wed Feb 19 06:41:03 2025] [ 6941]     0  6941      724      347       5       3        0             0 cat
[Wed Feb 19 06:41:03 2025] [ 6987]     0  6987  8395065      619       7       4        0             0 chrome_crashpad
[Wed Feb 19 06:41:03 2025] [ 6993]     0  6993  8457297     3915      86       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 6994]     0  6994  8457296     3618      86       5        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 7011]     0  7011  8472759     2923     105       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 9757]     0  9757  8518836     3606     137       7        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 9761]     0  9761      724      137       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [ 9762]     0  9762      724      367       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [ 9764]     0  9764  8395065      584       8       4        0             0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [ 9768]     0  9768  8457297     3898      86       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 9769]     0  9769  8457297     3945      87       5        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 9794]     0  9794  8472759     2958     107       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [24906]     0 24906  8500909     3435     132       5        0             0 chrome
[Wed Feb 19 06:41:04 2025] [24910]     0 24910      724      131       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [24911]     0 24911      724      306       6       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [24915]     0 24915  8395065      586       7       4        0             0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [24968]     0 24968  8457297     3949      86       5        0             0 chrome
[Wed Feb 19 06:41:04 2025] [24976]     0 24976  8457297     3792      88       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [25005]     0 25005  8472759     2979     106       5        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 8875]     0  8875  8503454     3452     131       5        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 8879]     0  8879      724      137       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [ 8880]     0  8880      724      336       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [ 8882]     0  8882  8395065      584       8       4        0             0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [ 8886]     0  8886  8457297     3906      87       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 8887]     0  8887  8457298     3707      85       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 8909]     0  8909  8475320     2947     107       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [21034]     0 21034  8500909     3458     130       6        0             0 chrome
[Wed Feb 19 06:41:04 2025] [21038]     0 21038      724      120       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [21039]     0 21039      724      348       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [21047]     0 21047  8395065      531       8       4        0             0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [21051]     0 21051  8457297     3634      86       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [21052]     0 21052  8457298     3957      86       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [21410]     0 21410  8472759     3325     106       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [29003]     0 29003  8503454     3569     134       5        0             0 chrome
[Wed Feb 19 06:41:04 2025] [29007]     0 29007      724      121       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [29008]     0 29008      724      276       6       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [29010]     0 29010  8395065      590       7       4        0             0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [29038]     0 29038  8457297     3926      88       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [29039]     0 29039  8457297     3435      86       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [29130]     0 29130  8472759     3017     107       6        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 8316]     0  8316  8503486     3458     133       5        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 8320]     0  8320      724      136       6       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [ 8321]     0  8321      724      351       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [ 8323]     0  8323  8395065      598       6       4        0             0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [ 8327]     0  8327  8457297     3933      87       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 8328]     0  8328  8457297     3852      87       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 8350]     0  8350  8475320     2930     107       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [20467]     0 20467  8503470     3462     131       5        0             0 chrome
[Wed Feb 19 06:41:04 2025] [20471]     0 20471      724      119       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [20472]     0 20472      724      360       6       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [20505]     0 20505  8395065      612       8       4        0             0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [20509]     0 20509  8457297     3968      86       5        0             0 chrome
[Wed Feb 19 06:41:04 2025] [20510]     0 20510  8457298     3989      87       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [20713]     0 20713  8472759     3312     106       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [15866]     0 15866  8503470     3471     135       6        0             0 chrome
[Wed Feb 19 06:41:04 2025] [15871]     0 15871      724      136       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [15872]     0 15872      724      374       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [15881]     0 15881  8395065      574       8       4        0             0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [15885]     0 15885  8457297     3872      87       5        0             0 chrome
[Wed Feb 19 06:41:04 2025] [15886]     0 15886  8457297     3661      87       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [16787]     0 16787  8472759     3222     105       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [15892]     0 15892  8503438     3502     130       5        0             0 chrome
[Wed Feb 19 06:41:04 2025] [15896]     0 15896      724      136       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [15897]     0 15897      724       92       5       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [15899]     0 15899  8395065      622       7       4        0             0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [15903]     0 15903  8457297     3915      86       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [15904]     0 15904  8457296     3834      85       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [15926]     0 15926  8472759     3282     107       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 7873]     0  7873  8503438     3501     132       5        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 7877]     0  7877      724      120       6       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [ 7878]     0  7878      724      110       6       3        0             0 cat
[Wed Feb 19 06:41:04 2025] [ 7880]     0  7880  8395065      519       7       4        0             0 chrome_crashpad
[Wed Feb 19 06:41:04 2025] [ 7884]     0  7884  8457297     3871      87       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 7885]     0  7885  8457298     3780      87       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [ 7915]     0  7915  8472759     2915     106       4        0             0 chrome
[Wed Feb 19 06:41:04 2025] [23440]     0 23440   336235    52523     264       4        0             0 celery
[Wed Feb 19 06:41:04 2025] [31521]     0 31521    16229    14441      35       3        0             0 celery
[Wed Feb 19 06:41:04 2025] Memory cgroup out of memory: Kill process 23440 (celery) score 199 or sacrifice child
[Wed Feb 19 06:41:04 2025] Killed process 23440 (celery) total-vm:1344940kB, anon-rss:202336kB, file-rss:7748kB, shmem-rss:0kB
[Wed Feb 19 06:41:04 2025] oom_reaper: reaped process 23440 (celery), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

However in CloudTrail, there is no OOM SubmitTaskChange event emitted that should trigger the related EventBridge alert rule.

ecs-service-connect-agent-v1.29.6.0-1.amzn2.x86_64
ecs-init-1.86.3-1.amzn2.x86_64

rantoniuk avatar Feb 19 '25 11:02 rantoniuk

Hello! Per the public docs

Task state change events are delivered in the following format. The detail section below resembles the Task object that is returned from a DescribeTasks API operation in the Amazon Elastic Container Service API Reference. If your containers are using an image hosted with Amazon ECR, the imageDigest field is returned.

Could you double check that your eventbridge rule can translate the task object that's returned by the DescribeTask API call? From what I can see on our side, ECS agent did seem to have sent over an STSC for this which our backend has ACK'd which suggests that the agent container/ECS service on the host itself is working as intended here.

Related docs/info:

  • See also DescribeTask response syntax -> https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_DescribeTasks.html#API_DescribeTasks_ResponseSyntax)
  • https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_task_events.html
  • https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_service_events.html

Hope this helps.

mye956 avatar Mar 13 '25 23:03 mye956

Thanks for following up @mye956. I'm not sure how you were able to confirm this on your side since you don't have any idea about the AWS account or anything else from the snippet above, but if that's really the case I'd guess you relied on the task IDs from the first line:

Task in /ecs/f209da7f990b45b585488946f0eec404/15e20c7f5997947ffa3adb2b2bff199eea61004e29501677c53751b2c549bb15 killed as a result of limit of /ecs/f209da7f990b45b585488946f0eec404/15e20c7f5997947ffa3adb2b2bff199eea61004e29501677c53751b2c549bb15

Back to your question, I'm not even looking at the EventBridge rule - I was looking at CloudTrail ECS events and in there I saw those events were not emitted.

The EventBridge rule that I have is as simple as:

  const rule = new Rule(this, 'EcsContainerOOMRestartRule', {
      ruleName: 'EcsTaskContainerRestartedOOMRule',
      eventPattern: {
        source: ['aws.ecs'],
        detailType: ['ECS Task State Change'],
        detail: {
          lastStatus: ['STOPPED'],
          stoppedReason: ['OutOfMemoryError: container killed due to memory usage'],
        },
      },
    });

and then if I don't see the event in CloudTrail then it means it wasn't there and I can't look at it's properties.

One thing that is on my mind, that could be happening here. The task / container that is causing this, has a celery broker that is running child processes as queue task consumers - and this is the process that is being killed, not the main container process.

The question is, if in that case ECS would still emit the event (since the OOM did happen, even though not on the task level) OR because the parent process is still running, so the task state did not change, this is an expected behavior and I need to add CloudWatch log monitor that looks for literal log message Kill process like this:

 const sigkillMetricFilter = new MetricFilter(this, 'SigkillMetricFilter', {
      filterPattern: FilterPattern.literal('SIGKILL'),
      logGroup: this.logGroup,
      metricNamespace,
      metricName: 'SigkillCount',
      metricValue: '1',
    });

    // Create a metric using the filter
    const sigkillMetric = new Metric({
      namespace: metricNamespace,
      metricName: 'SigkillCount',
      statistic: 'Sum',
      period: Duration.minutes(5),
    });

    const sigkillAlarm = new Alarm(this, 'SigkillAlarm', {
      alarmName: 'SIGKILL-Alarm',
      metric: sigkillMetric,
      threshold: 1,
      evaluationPeriods: 1,
      comparisonOperator: ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
      treatMissingData: TreatMissingData.NOT_BREACHING,
    });

rantoniuk avatar Mar 14 '25 12:03 rantoniuk

Hi @rantoniuk , yep that's right I just tried tracing back using the task ID from the logs that was provided.

One thing that is on my mind, that could be happening here. The task / container that is causing this, has a celery broker that is running child processes as queue task consumers - and this is the process that is being killed, not the main container process.

Just to clarify, we're seeing that the task container doesn't get OOM killed but a child process of it? Do we see the container exiting? If so, is it the only container within the task and/or an essential container?

mye956 avatar Apr 17 '25 19:04 mye956

Would it be better to match stopCode? Maybe the stoppedReason isn't an exact match?

Ref:

  • https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Task.html#ECS-Type-Task-stopCode (note that the API docs probably need recompiled because the list here is shorter than the list below)
  • https://docs.aws.amazon.com/AmazonECS/latest/developerguide/stopped-task-error-codes.html

also:

and this is the process that is being killed, not the main container process.

Is your container's root process something that speaks init (e.g., initProcessEnabled is true, or you are running another compatible init daemon)? The root process of a container has to do some process accounting stuff that might not be getting done. Does ECS show the task as stopped? (Maybe ECS won't emit the event if the container remains alive... which might happen if you're not running an init.)

ziggythehamster avatar Apr 17 '25 20:04 ziggythehamster

Does ECS show the task as stopped? (Maybe ECS won't emit the event if the container remains alive... which might happen if you're not running an init.)

No, the task is not_ stopped, only a child process of the task is getting killed - but the relevant dmesg message is visible. It should not be relevant whether the ECS task is stopped or not, since here we are operating on container resource limits - if the resource limit is set and any process is killed because of it, as in this case, an event should be emitted.

Is your container's root process something that speaks init (e.g., initProcessEnabled is true, or you are running another compatible init daemon)?

It's a queueing system, that spawns a child task for each new queue request.

The container final CMD looks like this:

CMD ["celery", "-A", "django_backend", "worker", "-P", "gevent", "-Ofair", "--loglevel", "info", "--concurrency", "1", "-Q", "gpu_q"]

rantoniuk avatar Apr 18 '25 16:04 rantoniuk

It should not be relevant whether the ECS task is stopped or not, since here we are operating on container resource limits

I would assume that ECS would have to see the state change of the task to emit an event to EventBridge/CloudTrail, since the event is about the task's state, not a subprocess of that task.

It's a queueing system, that spawns a child task for each new queue request.

That's totally fine, but if celery is PID 1 in the container, it will be expected to do more book-keeping with processes than it would ordinarily. This Yelp blog post from 2016 goes into detail about why this is the case and what would happen with signal handling without doing this additional book-keeping. Rather than use dumb-init, however, you can just use --init in Docker, which in ECS is initProcessEnabled, unless you are on a very old version of Docker.

The reason I bring this up is that if a subprocess goes OOM and is killed, but PID 1 in the container isn't doing its job, you could have a container up that consists mostly/entirely of dead processes, and neither Docker nor ECS would be able to know that since the signals weren't propagated correctly to the Linux kernel. This is a weird UNIX thing that bit me many years ago and should be covered in the Docker docs a lot better than it is 😅.

If ECS does actually send events for container subprocess failures, then I would expect running an init process would make those events appear. I don't think it does this, though, because the ECS task doesn't stop for me whenever I have a subprocess crash, and you can imagine that it would be really disruptive for a web app server to do this when it could respawn a worker with less disruption. Because of that, I think that you're only going to see state changes in ECS if the task itself is seen as failed.. and it might be that the task itself being seen as failed isn't happening because of the lack of an init daemon at PID 1.

If you're wanting to get alerted on OOMs, I would run fluent-bit or CloudWatch Agent and send the systemd journal to CloudWatch Logs (optionally filter the sender for only OOM events).

ziggythehamster avatar Apr 18 '25 19:04 ziggythehamster

I would assume that ECS would have to see the state change of the task to emit an event to EventBridge/CloudTrail, since the event is about the task's state, not a subprocess of that task.

If ECS does actually send events for container subprocess failures, then I would expect running an init process would make those events appear. I don't think it does this, though, because the ECS task doesn't stop for me whenever I have a subprocess crash, and you can imagine that it would be really disruptive for a web app server to do this when it could respawn a worker with less disruption

I do follow your thoughts about signals, PID 1 and child tasks, but I think you're missing my point here.

You're looking at this from ECS task perspective (and I'm not saying you're wrong given that we're in amazon-ecs-agent GH repository scope). Indeed, the task is not stopped, as the parent process is running.

However, my point is different. I am expecting ECS service to be able to see and react on OOM kernel events that are happening, because I can see them in dmesg.

and you can imagine that it would be really disruptive for a web app server to do this when it could respawn a worker with less disruption

That's a valid point, but there is a significant difference between a process that terminates gracefully and is respawned (child job tasks) and a KILL that is enforced via a resource limit on the docker container/cgroup level. In other words, if there is a limit set in docker and that limit is exceeded and something is killed because of it - and that is visible in dmesg - then I do want to get notified about it because it most probably needs raising the limits.

Because of that, I think that ECS should emit somehow the event to EventBridge in such case - whether it's consumed from there that's another story.

See also https://github.com/aws/amazon-ecs-ami/issues/240 - I think those issues are related.

Hope that makes sense.

rantoniuk avatar May 13 '25 11:05 rantoniuk