zdb memory usage
How much memory should be provided to zdb to prevent it from dying. It got killed by OOM for writing 20GB by zstor.
Using zdb v2.0.2 static I could upload 34 GB of data (116,850 entries) and zdb was using 139 MB memory.
What's your limit ?
There is maybe a memory issue somewhere else than normal operation. This test was made while uploading chunks for hub. It's possible that some memory issue occurs with INDEX DIRTY. I'll do more test with zstor.
The limit is 256MB on zos and it only has the zdb process in the container (on which the limit is applied).
It's possible that it's a cache problem and the memory is allocated as a cache and not directly by zdb. From the OOM logs, the vm (virtual memory) used is about ~143MB, but "file" metric is ~281MB. The cgroup of the container running zdb has these data:
memory.usage_in_bytes = 315551744 # ~300MB
memory.limit_in_bytes = 315621376 # ~300MB
memory.stat = "
...
total_cache 299261952 # 285MB
total_rss 9056256 # 8MB
...
"
from the cgroup memory stats documentation 5.5. The usage is rss+cache (the usage_in_bytes file is an approx.).
OOM logs:
[5531327.506177] zdb invoked oom-killer: gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=1
[5531327.506425] memory: usage 308224kB, limit 308224kB, failcnt 3270355
[5531327.506427] memory+swap: usage 308224kB, limit 9007199254740988kB, failcnt 0
[5531327.506429] kmem: usage 1952kB, limit 9007199254740988kB, failcnt 0
[5531327.506430] Memory cgroup stats for /zdb/***:
[5531327.506452] anon 16760832
file 296288256
kernel_stack 49152
percpu 0
sock 0
shmem 0
file_mapped 135168
file_dirty 295342080
file_writeback 0
anon_thp 0
inactive_anon 16760832
active_anon 0
inactive_file 179277824
active_file 115965952
unevictable 0
slab_reclaimable 655664
slab_unreclaimable 0
slab 655664
workingset_refault_anon 0
workingset_refault_file 7194
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 125499
pgfault 761112
pgmajfault 0
pgrefill 2668220
pgscan 15364605
pgsteal 8869680
pgactivate 4764903
pgdeactivate 2668220
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0
[5531327.506454] Tasks state (memory values in pages):
[5531327.506455] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[5531327.506459] [ 27241] 0 27241 35271 4433 323584 0 1 zdb
[5531327.506461] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=***,mems_allowed=0-1,oom_memcg=/zdb/***,task_memcg=/zdb/476c
2c73-5ac3-4f8e-a07a-2ff21d4f8113,task=zdb,pid=27241,uid=0
[5531327.506475] Memory cgroup out of memory: Killed process 27241 (zdb) total-vm:141084kB, anon-rss:16708kB, file-rss:1024kB, shmem-rss:0kB, UID:0 pgtables:316kB oom_score_adj:1
[5531327.508936] oom_reaper: reaped process 27241 (zdb), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
I also think the buf/cache is freed if needed, because they are just used by the kernel to improve performance of file access. But if needed, data will be flushed to disk and then allocated to the process. Not the other way around.
Some things to investigate:
- Does that still happen if you reduce datasize to let's say
32M? By default, datafile are split on256M. - Can you try
development-v2-hashmembranch, which not allocate a static128Mbuffer
hashmem also OOMs, zdb process restarts occasionally using 32M as datasize but not because of OOM (no dmesg logs, and container logs are truncated, segfault?).
Not reproducible with me locally at all, started the zos vm on an hdd and the usage (rss + cache) always oscillates a little bit behind the limit just like in the nodes' case but it never gets killed.
Stracing the original zdb version process while it's operating to check what it's doing at the end, causes it to not be killed by OOM!
Tried to make another image which pushes its logs to some files to prevent it from being truncated. The store command fulfillment rate became much slower and the processes piled up causing the qsfs container zstor to be OOMed, but the zdb container didn't misbehave at all (except for being slow?). Chuncking the writes to cope with the slow rate didn't result in any OOM or crash.
OOM logs with stacktrace and registers (for hashmem):
[6643776.606669] zdb invoked oom-killer: gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=1
[6643776.606675] CPU: 10 PID: 1891 Comm: zdb Not tainted 5.10.55-Zero-OS #2
[6643776.606678] Call Trace:
[6643776.606690] dump_stack+0x6d/0x88
[6643776.606696] dump_header+0x4a/0x220
[6643776.606699] oom_kill_process+0xe4/0x140
[6643776.606703] out_of_memory+0x113/0x520
[6643776.606708] mem_cgroup_out_of_memory+0xdf/0x100
[6643776.606712] try_charge+0x737/0x770
[6643776.606715] mem_cgroup_charge+0x85/0x280
[6643776.606719] __add_to_page_cache_locked+0x3bf/0x440
[6643776.606725] ? scan_shadow_nodes+0x30/0x30
[6643776.606727] add_to_page_cache_lru+0x4a/0xc0
[6643776.606730] pagecache_get_page+0xff/0x2d0
[6643776.606778] ? btrfs_csum_bytes_to_leaves+0x24/0x60 [btrfs]
[6643776.606805] prepare_pages+0xf5/0x240 [btrfs]
[6643776.606831] btrfs_buffered_write.isra.33+0x231/0x700 [btrfs]
[6643776.606855] btrfs_file_write_iter+0x4ac/0x650 [btrfs]
[6643776.606861] new_sync_write+0x11f/0x1b0
[6643776.606865] vfs_write+0x1fd/0x260
[6643776.606868] ksys_write+0xa1/0xe0
[6643776.606871] ? __x64_sys_gettimeofday+0x6e/0xc0
[6643776.606874] do_syscall_64+0x33/0x40
[6643776.606876] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[6643776.606880] RIP: 0033:0x4843c1
[6643776.606885] Code: Unable to access opcode bytes at RIP 0x484397.
[6643776.606887] RSP: 002b:00007ffe99a807b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[6643776.606891] RAX: ffffffffffffffda RBX: 00000000016f6544 RCX: 00000000004843c1
[6643776.606892] RDX: 00000000001f1111 RSI: 0000000001f38bc0 RDI: 0000000000000014
[6643776.606894] RBP: 00007ffe99a807f0 R08: 0000000000000000 R09: 0000000000000029
[6643776.606895] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000041b820
[6643776.606897] R13: 0000000000000000 R14: 0000000000722018 R15: 0000000000000000
[6643776.606901] memory: usage 308224kB, limit 308224kB, failcnt 1921778
[6643776.606902] memory+swap: usage 308224kB, limit 9007199254740988kB, failcnt 0
[6643776.606904] kmem: usage 3820kB, limit 9007199254740988kB, failcnt 0
[6643776.606905] Memory cgroup stats for /zdb/***:
[6643776.606922] anon 13246464
file 297775104
kernel_stack 49152
percpu 0
sock 0
shmem 0
file_mapped 135168
file_dirty 295071744
file_writeback 270336
anon_thp 0
inactive_anon 13516800
active_anon 0
inactive_file 180531200
active_file 116031488
unevictable 0
slab_reclaimable 3411016
slab_unreclaimable 0
slab 3411016
workingset_refault_anon 0
workingset_refault_file 363
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 56628
pgfault 1969704
pgmajfault 0
pgrefill 2307585
pgscan 10195449
pgsteal 5283649
pgactivate 3173676
pgdeactivate 2307166
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0
[6643776.606924] Tasks state (memory values in pages):
[6643776.606925] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[6643776.606928] [ 1891] 0 1891 4009 3338 61440 0 1 zdb
[6643776.606930] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=***,mems_allowed=0-1,oom_memcg=/zdb/***,task_memcg=/zdb/***,task=zdb,pid=1891,uid=0
[6643776.606944] Memory cgroup out of memory: Killed process 1891 (zdb) total-vm:16036kB, anon-rss:13348kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:60kB oom_score_adj:1
[6643776.609605] oom_reaper: reaped process 1891 (zdb), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
If I understand correctly from the logs, it failed because of a write syscall (with 2MB data?), and the total-vm indicates the process part is just ~16MB.
Thanks !
It's possible to get segfault on hashmem, it's still in testing stage. Did you tried to reproduce it with cgroup memory limits ?
There is a specific sync call on file rotation. I don't know if the stacktrace talk about a write or a sync, it's btrfs related. I don't know what's exactly take in account for computing memory usage, if storage cache (eg: btrfs cache) is computed as well, that will be a problem and we need some btrfs locally to compare too.
The zdb restarting occasionally is the latest with 32MB (not sure why as the logs are truncated).
The memory limit is set on the node eventually using cgroups. But wasn't able to reproduce outside the node with:
- starting zdb inside a docker with a memory limit and doing the same qsfs setup on top of it.
- making a btrfs (truncate -> mkfs -> mount inside the host -> mount to the container data dir) on ssd and hdd, and retrying.
- a zos vm on top of hdd (it uses btrfs by default)
I guessed the syscall from table. ORIG_RAX equals to 1.
Edited the oom logs with a missing line at the beginning containing gfp_mask, might be useful.