ltp icon indicating copy to clipboard operation
ltp copied to clipboard

oom kill runltp parent process

Open LYanfeng0601 opened this issue 2 years ago • 6 comments

When the oom testcase is running, the runltp process is killed. As a result, subsequent testcasess cannot be executed. error log: Killing process 97867 (runltp) with signal SIGTERM

LYanfeng0601 avatar Apr 26 '23 07:04 LYanfeng0601

I also hit a similar problem because of a systemd bug. Which systemd version does your system use?

xuyang0410 avatar Apr 26 '23 07:04 xuyang0410

I also hit a similar problem because of a systemd bug. Which systemd version does your system use?

hello, systemd version is v243*:https://github.com/systemd/systemd/tree/v243

LYanfeng0601 avatar Apr 26 '23 08:04 LYanfeng0601

It seems not a same problem. Do you have full dmesg from this run?

xuyang0410 avatar Apr 26 '23 08:04 xuyang0410

It seems not a same problem. Do you have full dmesg from this run? image

image

LYanfeng0601 avatar Apr 26 '23 10:04 LYanfeng0601

It seems it also killed ssh-agent process. I guess you ssh to this machine and then run ltp test case, then session was closed. Is it right? Which linux distribution version do you use?

xuyang0410 avatar Apr 27 '23 05:04 xuyang0410

@xuyang0410 hi, i also encountered same problem. runltp was killed by oom-killer when oom02 was executed.


May 16 21:02:24 localhost kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
May 16 21:02:24 localhost kernel: [   1211]     0  1211      544        6   327680      107         -1000 systemd-udevd
May 16 21:02:24 localhost kernel: [   1501]    81  1501      517       35   393216       44          -900 dbus-daemon

...

May 16 21:02:24 localhost kernel: [  98679]     0 98679     3549        2   458752       29             0 runltp
May 16 21:02:24 localhost kernel: [  98867]     0 98867       56        0   393216       24             0 ltp-pan
May 16 21:02:24 localhost kernel: [2343580]     0 2343580     1014      124   393216       78          -250 systemd-journal

...

May 16 21:02:24 localhost kernel: [ 912894]     0 912894       52        0   393216       14         -1000 oom02
May 16 21:02:24 localhost kernel: [ 912895]     0 912895       52        0   393216       17         -1000 oom02
May 16 21:02:24 localhost kernel: [ 913199]     0 913199     4465       38   393216      122             0 sssd_be
May 16 21:02:24 localhost kernel: [ 913201]   997 913201    39822       84   786432       78             0 polkitd
May 16 21:02:24 localhost kernel: [ 913290]     0 913290    37091       82   720896      395             0 Xorg
May 16 21:02:24 localhost kernel: [ 913299]     0 913299     4773       41   327680       85             0 sssd_nss
May 16 21:02:24 localhost kernel: [ 913493]     0 913493    17180      313   524288      379             0 tuned
May 16 21:02:24 localhost kernel: [ 913497]     0 913497     6504        2   458752      234             0 udisksd
May 16 21:02:24 localhost kernel: [ 913866]   987 913866      298        0   327680       43             0 dbus-launch
May 16 21:02:24 localhost kernel: [ 913867]     0 913867     7593       23   458752      154             0 NetworkManager
May 16 21:02:24 localhost kernel: [ 913893]   987 913893      475        0   327680       47             0 dbus-daemon
May 16 21:02:24 localhost kernel: [ 914005]   987 914005    10302        1   458752      778             0 onboard
May 16 21:02:24 localhost kernel: [ 914008]   987 914008     4825        0   393216       63             0 at-spi-bus-laun
May 16 21:02:24 localhost kernel: [ 914013]   987 914013      473        0   393216       52             0 dbus-daemon
May 16 21:02:24 localhost kernel: [ 914139]     0 914139  3104755  1831937 18481152       75             0 oom02
May 16 21:02:24 localhost kernel: Out of memory: Kill process 914139 (oom02) score 712 or sacrifice child
May 16 21:02:24 localhost kernel: Killed process 914139 (oom02) total-vm:198704320kB, anon-rss:117243072kB, file-rss:640kB, shmem-rss:0kB
May 16 21:02:24 localhost kernel: oom_reaper: reaped process 914139 (oom02), now anon-rss:117261376kB, file-rss:0kB, shmem-rss:0kB
May 16 21:02:24 localhost kernel: oom02 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0

First, oom-killer kills oom02, and reclaims its memory, but it fails. Becuase the memory was locked.

The following is the trace log i added to the kernel:

      oom_reaper-57    [007] ....   126.063581: __oom_reap_task_mm: gh: vma is anon:1048691, range=65536
      oom_reaper-57    [007] ....   126.063581: __oom_reap_task_mm: gh: vma is anon:1048691, range=196608
      oom_reaper-57    [007] ....   126.063582: __oom_reap_task_mm: gh: vma continue: 1056883, range:3221225472
      oom_reaper-57    [007] ....   126.063583: __oom_reap_task_mm: gh: vma is anon:112, range=65536
      oom_reaper-57    [007] ....   126.063584: __oom_reap_task_mm: gh: vma is anon:1048691, range=8388608

vma continue: 1056883, range:3221225472 is the memory that can not reclaims. 1057883(0x102073) is vma->vm_flags, it has VM_LOCKED` flag,indicating that the memory is in use and cannot be reclaimed. It will be released when it is no longer used.

Next, oom-killer tries to kill other processes to gain memory. Unfortunately, runltp was killed,

May 16 21:02:24 localhost kernel: [ 914008]   987 914008     4825        0   393216       65             0 at-spi-bus-laun
May 16 21:02:24 localhost kernel: [ 914013]   987 914013      473        0   393216       52             0 dbus-daemon
May 16 21:02:24 localhost kernel: [ 914015]   987 914015     2583        0   458752       78             0 at-spi2-registr
May 16 21:02:24 localhost kernel: [ 914139]     0 914139  3104755  1832837 18481152        0             0 oom02
May 16 21:02:24 localhost kernel: Out of memory: Kill process 913199 (sssd_be) score 0 or sacrifice child
May 16 21:02:24 localhost kernel: Killed process 913199 (sssd_be) total-vm:285760kB, anon-rss:640kB, file-rss:0kB, shmem-rss:0kB
May 16 21:02:24 localhost kernel: oom_reaper: reaped process 913199 (sssd_be), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May 16 21:02:24 localhost kernel: Out of memory: Kill process 912518 (sssd) score 0 or sacrifice child
May 16 21:02:24 localhost kernel: Killed process 912624 (sssd_pam) total-vm:272192kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May 16 21:02:24 localhost kernel: oom_reaper: reaped process 912624 (sssd_pam), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May 16 21:02:24 localhost kernel: Out of memory: Kill process 912518 (sssd) score 0 or sacrifice child

... // many processes killed

May 16 21:02:24 localhost kernel: Out of memory: Kill process 98679 (runltp) score 0 or sacrifice child
May 16 21:02:24 localhost kernel: Killed process 98867 (ltp-pan) total-vm:3584kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May 16 21:02:24 localhost kernel: oom_reaper: reaped process 98867 (ltp-pan), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May 16 21:02:24 localhost kernel: Out of memory: Kill process 98679 (runltp) score 0 or sacrifice child
May 16 21:02:24 localhost kernel: Killed process 98679 (runltp) total-vm:227136kB, anon-rss:0kB, file-rss:128kB, shmem-rss:0kB
May 16 21:02:24 localhost kernel: oom_reaper: reaped process 98679 (runltp), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May 16 21:02:24 localhost kernel: Out of memory: Kill process 1755 (atd) score 0 or sacrifice child

oom02 set the oom_score_adj of parent-oom02 to -1000, prevent being killed by oom-killer, and set oom_score_adj of the child-oom02 to 0.

So, should we set the default oom_score_adj of the runltp to -1000 too?

gouhao avatar May 19 '23 06:05 gouhao