ZTS: OOM in raidz_002_pos
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Fedora |
| Distribution Version | 40 |
| Kernel Version | 6.10 |
| Architecture | x86_64 |
| OpenZFS Version |
Describe the problem you're observing
Using the new github runners, we're seeing an occasional OOM in functional/raidz/raidz_002_pos. It is killing off the raidz_test program:
Test: /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_002_pos (run as root) [03:30] [FAIL]
08:41:42.14 /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_002_pos.ksh[49]: log_must[70]: log_pos: line 265: 918355: Killed
08:41:42.14 20/176... 40/165... 60/165... 80/165... 100/165... 120/165... ERROR: raidz_test -S -e -t 300 exited 265
raidz_test had allocated 5.3GB of RAM:
Out of memory: Killed process 918355 (raidz_test) total-vm:13275572kB, anon-rss:5306400kB, file-rss:56kB, shmem-rss:0kB, UID:0 pgtables:24564kB oom_score_adj:0
[ 7605.935208] systemd-userdbd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[ 7605.938835] CPU: 1 PID: 708 Comm: systemd-userdbd Tainted: P OE 6.10.10-200.fc40.x86_64 #1
[ 7605.941634] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 7605.944347] Call Trace:
[ 7605.945197] <TASK>
[ 7605.945978] dump_stack_lvl+0x5d/0x80
[ 7605.947381] dump_header+0x44/0x18d
[ 7605.948634] oom_kill_process.cold+0xa/0xaa
[ 7605.949964] out_of_memory+0x219/0x4b0
[ 7605.951262] __alloc_pages_slowpath.constprop.0+0xb4e/0xe00
[ 7605.953023] __alloc_pages_noprof+0x31f/0x350
[ 7605.954412] alloc_pages_mpol_noprof+0xd7/0x1e0
[ 7605.955867] ? __filemap_get_folio+0x37/0x2e0
[ 7605.957254] vma_alloc_folio_noprof+0x63/0xc0
[ 7605.958667] ? __swap_duplicate+0xdb/0x190
[ 7605.960007] do_swap_page+0x4a9/0xd60
[ 7605.961215] ? srso_alias_return_thunk+0x5/0xfbef5
[ 7605.962768] ? __handle_mm_fault+0x829/0x1080
[ 7605.964150] ? srso_alias_return_thunk+0x5/0xfbef5
[ 7605.965656] ? __pte_offset_map+0x1b/0x180
[ 7605.966971] __handle_mm_fault+0x829/0x1080
[ 7605.968335] ? srso_alias_return_thunk+0x5/0xfbef5
[ 7605.969820] ? mt_find+0x21c/0x580
[ 7605.971016] handle_mm_fault+0xf0/0x300
[ 7605.972239] do_user_addr_fault+0x15d/0x620
[ 7605.973660] ? srso_alias_return_thunk+0x5/0xfbef5
[ 7605.975112] ? asm_exc_page_fault+0x26/0x30
[ 7605.976458] exc_page_fault+0x7e/0x180
[ 7605.977673] asm_exc_page_fault+0x26/0x30
[ 7605.978963] RIP: 0010:__get_user_8+0x11/0x20
Full examples: https://github.com/openzfs/zfs/actions/runs/10978174081/job/30481019124 https://github.com/openzfs/zfs/actions/runs/10998799735/job/30537538603
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
The raidz_002_pos happens to FreeBSD and Ubuntu also.
Even on VMs with 12 GB RAM the problem happens sometimes :/
I've just manually run ./raidz_test -S -e -t 300 on FreeBSD and observed it gradually consuming >106GB of RAM before completing successfully. I bet something is leaking there inside the loops, but haven't got what yet.