peloton
peloton copied to clipboard
Better failure message on hitting resource limitations.
Currently when someone schedules a job that has a chance of using all resources allocated by it's cgroup it reports REASON_COMMAND_EXECUTOR_FAILED in the UI. From looking at the host that this happens on it seems like peloton/mesos knows that it is failing from hitting this limit...
Task in /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b killed as a result of limit of /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b
Would it be possible to bubble up in the UI the reason for the job being killed was due to resource constraint and not due to any issue with the code itself that was running.
Could you point it to me where do you see the "killed as a result of limit..." message?
I saw that from dmesg on the host. here is the full output with some stuff removed.
[1226766.998728] Task in /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b killed as a result of limit of /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b
[1226767.011413] memory: usage 5275648kB, limit 5275648kB, failcnt 246819
[1226767.017969] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[1226767.024779] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[1226767.030975] Memory cgroup stats for /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b: cache:67620KB rss:5208028KB rss_huge:0KB mapped_file:67584KB dirty:0KB writeback:0KB inactive_anon:67584KB active_anon:5208028KB inactive_file:8KB active_file:8KB unevictable:0KB
[1226767.054381] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[1226767.063398] [26342] 0 26342 43688 10963 84 3 0 0 mesos-container
[1226767.073045] [26400] 0 26400 654224 11477 152 5 0 0 mesos-executor
[1226767.082643] [26468] 16451 26468 1111 213 7 3 0 0 sh
[1226767.091187] [26472] 16451 26472 709790 89013 659 5 0 0 python2.7
[1226767.100347] [26595] 16451 26595 318312 7020 77 4 0 0 dbh_clone
[1226767.109477] [26788] 16451 26788 10221696 1452438 4080 22 0 0 python2.7
[1226767.118639] [26814] 16451 26814 81541 4277 125 3 0 0 dbh_clone
[1226767.127788] Memory cgroup out of memory: Kill process 26788 (python2.7) score 1104 or sacrifice child
[1226767.137218] Killed process 26814 (dbh_clone) total-vm:326164kB, anon-rss:11204kB, file-rss:5904kB
[1226831.939895] python2.7 invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0
[1226831.947980] python2.7 cpuset=/ mems_allowed=0-1
[1226831.952737] CPU: 21 PID: 26774 Comm: python2.7 Tainted: P OE 4.4.92 #1
[1226831.960489] Hardware name: Redacted, BIOS Redacted
[1226831.968563] 0000000000000286 25203cfc5af37d64 ffffffff812f97a5 ffff883fcccf3e20
[1226831.976208] ffff881fcee8cc00 ffffffff811db195 ffff883ff22e6a00 ffffffff810a1630
[1226831.983853] ffff883ff22e6a00 ffffffff8116dd26 ffff880ffcbc4f80 ffff883f368762b8
[1226831.991480] Call Trace:
[1226831.994102] [<ffffffff812f97a5>] ? dump_stack+0x5c/0x77
[1226831.999585] [<ffffffff811db195>] ? dump_header+0x62/0x1d7
[1226832.005240] [<ffffffff810a1630>] ? check_preempt_curr+0x50/0x90
[1226832.011422] [<ffffffff8116dd26>] ? find_lock_task_mm+0x36/0x80
[1226832.017516] [<ffffffff8116e2b1>] ? oom_kill_process+0x211/0x3d0
[1226832.023696] [<ffffffff811d385f>] ? mem_cgroup_iter+0x1cf/0x360
[1226832.029798] [<ffffffff811d56f3>] ? mem_cgroup_out_of_memory+0x283/0x2c0
[1226832.036671] [<ffffffff811d63cd>] ? mem_cgroup_oom_synchronize+0x32d/0x340
[1226832.043714] [<ffffffff811d1a80>] ? mem_cgroup_begin_page_stat+0x90/0x90
[1226832.050589] [<ffffffff8116e994>] ? pagefault_out_of_memory+0x44/0xc0
[1226832.057214] [<ffffffff815a98b8>] ? page_fault+0x28/0x30
On thinking about it more i'm guessing mesos might not actually know that this is getting killed for ooming but thought it was worth looking into.
@michaeljs1990 Thanks for raising the concern, I looked more into it and Mesos exposes detailed reasons why the container was terminated which includes memory limit (ln 2607). I will fix this.
Awesome to hear! thanks.
Was this added? I believe I was seeing better error messaged in the UI around this now or possibly I'm imagining things.
@vargup did you add the change ?
@vargup bump, can you please advise if this was changed?