zfs icon indicating copy to clipboard operation
zfs copied to clipboard

ZTS: OOM in raidz_002_pos

Open tonyhutter opened this issue 1 year ago • 2 comments

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 40
Kernel Version 6.10
Architecture x86_64
OpenZFS Version

Describe the problem you're observing

Using the new github runners, we're seeing an occasional OOM in functional/raidz/raidz_002_pos. It is killing off the raidz_test program:

Test: /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_002_pos (run as root) [03:30] [FAIL]
08:41:42.14 /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_002_pos.ksh[49]: log_must[70]: log_pos: line 265: 918355: Killed
08:41:42.14 20/176... 40/165... 60/165... 80/165... 100/165... 120/165... ERROR: raidz_test -S -e -t 300 exited 265

raidz_test had allocated 5.3GB of RAM:

Out of memory: Killed process 918355 (raidz_test) total-vm:13275572kB, anon-rss:5306400kB, file-rss:56kB, shmem-rss:0kB, UID:0 pgtables:24564kB oom_score_adj:0
 [ 7605.935208] systemd-userdbd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
  [ 7605.938835] CPU: 1 PID: 708 Comm: systemd-userdbd Tainted: P           OE      6.10.10-200.fc40.x86_64 #1
  [ 7605.941634] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  [ 7605.944347] Call Trace:
  [ 7605.945197]  <TASK>
  [ 7605.945978]  dump_stack_lvl+0x5d/0x80
  [ 7605.947381]  dump_header+0x44/0x18d
  [ 7605.948634]  oom_kill_process.cold+0xa/0xaa
  [ 7605.949964]  out_of_memory+0x219/0x4b0
  [ 7605.951262]  __alloc_pages_slowpath.constprop.0+0xb4e/0xe00
  [ 7605.953023]  __alloc_pages_noprof+0x31f/0x350
  [ 7605.954412]  alloc_pages_mpol_noprof+0xd7/0x1e0
  [ 7605.955867]  ? __filemap_get_folio+0x37/0x2e0
  [ 7605.957254]  vma_alloc_folio_noprof+0x63/0xc0
  [ 7605.958667]  ? __swap_duplicate+0xdb/0x190
  [ 7605.960007]  do_swap_page+0x4a9/0xd60
  [ 7605.961215]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 7605.962768]  ? __handle_mm_fault+0x829/0x1080
  [ 7605.964150]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 7605.965656]  ? __pte_offset_map+0x1b/0x180
  [ 7605.966971]  __handle_mm_fault+0x829/0x1080
  [ 7605.968335]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 7605.969820]  ? mt_find+0x21c/0x580
  [ 7605.971016]  handle_mm_fault+0xf0/0x300
  [ 7605.972239]  do_user_addr_fault+0x15d/0x620
  [ 7605.973660]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 7605.975112]  ? asm_exc_page_fault+0x26/0x30
  [ 7605.976458]  exc_page_fault+0x7e/0x180
  [ 7605.977673]  asm_exc_page_fault+0x26/0x30
  [ 7605.978963] RIP: 0010:__get_user_8+0x11/0x20

Full examples: https://github.com/openzfs/zfs/actions/runs/10978174081/job/30481019124 https://github.com/openzfs/zfs/actions/runs/10998799735/job/30537538603

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

tonyhutter avatar Sep 24 '24 18:09 tonyhutter

The raidz_002_pos happens to FreeBSD and Ubuntu also. Even on VMs with 12 GB RAM the problem happens sometimes :/

mcmilk avatar Oct 06 '24 15:10 mcmilk

I've just manually run ./raidz_test -S -e -t 300 on FreeBSD and observed it gradually consuming >106GB of RAM before completing successfully. I bet something is leaking there inside the loops, but haven't got what yet.

amotin avatar Oct 17 '24 20:10 amotin