peloton Better failure message on hitting resource limitations.

Currently when someone schedules a job that has a chance of using all resources allocated by it's cgroup it reports REASON_COMMAND_EXECUTOR_FAILED in the UI. From looking at the host that this happens on it seems like peloton/mesos knows that it is failing from hitting this limit...

Task in /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b killed as a result of limit of /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b

Would it be possible to bubble up in the UI the reason for the job being killed was due to resource constraint and not due to any issue with the code itself that was running.

Apr 02 '19 17:04 michaeljs1990

Could you point it to me where do you see the "killed as a result of limit..." message?

Apr 02 '19 17:04 zhixinwen

I saw that from dmesg on the host. here is the full output with some stuff removed.

[1226766.998728] Task in /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b killed as a result of limit of /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b
[1226767.011413] memory: usage 5275648kB, limit 5275648kB, failcnt 246819
[1226767.017969] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[1226767.024779] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[1226767.030975] Memory cgroup stats for /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b: cache:67620KB rss:5208028KB rss_huge:0KB mapped_file:67584KB dirty:0KB writeback:0KB inactive_anon:67584KB active_anon:5208028KB inactive_file:8KB active_file:8KB unevictable:0KB
[1226767.054381] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[1226767.063398] [26342]     0 26342    43688    10963      84       3        0             0 mesos-container
[1226767.073045] [26400]     0 26400   654224    11477     152       5        0             0 mesos-executor
[1226767.082643] [26468] 16451 26468     1111      213       7       3        0             0 sh
[1226767.091187] [26472] 16451 26472   709790    89013     659       5        0             0 python2.7
[1226767.100347] [26595] 16451 26595   318312     7020      77       4        0             0 dbh_clone
[1226767.109477] [26788] 16451 26788 10221696  1452438    4080      22        0             0 python2.7
[1226767.118639] [26814] 16451 26814    81541     4277     125       3        0             0 dbh_clone
[1226767.127788] Memory cgroup out of memory: Kill process 26788 (python2.7) score 1104 or sacrifice child
[1226767.137218] Killed process 26814 (dbh_clone) total-vm:326164kB, anon-rss:11204kB, file-rss:5904kB
[1226831.939895] python2.7 invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0
[1226831.947980] python2.7 cpuset=/ mems_allowed=0-1
[1226831.952737] CPU: 21 PID: 26774 Comm: python2.7 Tainted: P           OE   4.4.92 #1
[1226831.960489] Hardware name: Redacted, BIOS Redacted
[1226831.968563]  0000000000000286 25203cfc5af37d64 ffffffff812f97a5 ffff883fcccf3e20
[1226831.976208]  ffff881fcee8cc00 ffffffff811db195 ffff883ff22e6a00 ffffffff810a1630
[1226831.983853]  ffff883ff22e6a00 ffffffff8116dd26 ffff880ffcbc4f80 ffff883f368762b8
[1226831.991480] Call Trace:
[1226831.994102]  [<ffffffff812f97a5>] ? dump_stack+0x5c/0x77
[1226831.999585]  [<ffffffff811db195>] ? dump_header+0x62/0x1d7
[1226832.005240]  [<ffffffff810a1630>] ? check_preempt_curr+0x50/0x90
[1226832.011422]  [<ffffffff8116dd26>] ? find_lock_task_mm+0x36/0x80
[1226832.017516]  [<ffffffff8116e2b1>] ? oom_kill_process+0x211/0x3d0
[1226832.023696]  [<ffffffff811d385f>] ? mem_cgroup_iter+0x1cf/0x360
[1226832.029798]  [<ffffffff811d56f3>] ? mem_cgroup_out_of_memory+0x283/0x2c0
[1226832.036671]  [<ffffffff811d63cd>] ? mem_cgroup_oom_synchronize+0x32d/0x340
[1226832.043714]  [<ffffffff811d1a80>] ? mem_cgroup_begin_page_stat+0x90/0x90
[1226832.050589]  [<ffffffff8116e994>] ? pagefault_out_of_memory+0x44/0xc0
[1226832.057214]  [<ffffffff815a98b8>] ? page_fault+0x28/0x30

On thinking about it more i'm guessing mesos might not actually know that this is getting killed for ooming but thought it was worth looking into.

Apr 02 '19 18:04 michaeljs1990

@michaeljs1990 Thanks for raising the concern, I looked more into it and Mesos exposes detailed reasons why the container was terminated which includes memory limit (ln 2607). I will fix this.

Apr 03 '19 17:04 varungup90

Awesome to hear! thanks.

Apr 04 '19 15:04 michaeljs1990

Was this added? I believe I was seeing better error messaged in the UI around this now or possibly I'm imagining things.

Jun 06 '19 20:06 michaeljs1990

@vargup did you add the change ?

Jun 11 '19 22:06 mabansal

@vargup bump, can you please advise if this was changed?

Jul 01 '19 18:07 talaniz

peloton peloton copied to clipboard

Better failure message on hitting resource limitations.

peloton
peloton copied to clipboard