0-db zdb memory usage

How much memory should be provided to zdb to prevent it from dying. It got killed by OOM for writing 20GB by zstor.

Jan 10 '22 08:01 OmarElawady

Using zdb v2.0.2 static I could upload 34 GB of data (116,850 entries) and zdb was using 139 MB memory. What's your limit ?

There is maybe a memory issue somewhere else than normal operation. This test was made while uploading chunks for hub. It's possible that some memory issue occurs with INDEX DIRTY. I'll do more test with zstor.

Jan 14 '22 00:01 maxux

The limit is 256MB on zos and it only has the zdb process in the container (on which the limit is applied).

Jan 17 '22 09:01 OmarElawady

It's possible that it's a cache problem and the memory is allocated as a cache and not directly by zdb. From the OOM logs, the vm (virtual memory) used is about ~143MB, but "file" metric is ~281MB. The cgroup of the container running zdb has these data:

memory.usage_in_bytes = 315551744 # ~300MB
memory.limit_in_bytes = 315621376 # ~300MB
memory.stat = "
...
total_cache 299261952 # 285MB
total_rss 9056256         # 8MB
...
"

from the cgroup memory stats documentation 5.5. The usage is rss+cache (the usage_in_bytes file is an approx.).

OOM logs:

[5531327.506177] zdb invoked oom-killer: gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=1                                                      
[5531327.506425] memory: usage 308224kB, limit 308224kB, failcnt 3270355
[5531327.506427] memory+swap: usage 308224kB, limit 9007199254740988kB, failcnt 0
[5531327.506429] kmem: usage 1952kB, limit 9007199254740988kB, failcnt 0 
[5531327.506430] Memory cgroup stats for /zdb/***:                                                                                                              
[5531327.506452] anon 16760832                                                                       
                 file 296288256                                                                      
                 kernel_stack 49152                                                                  
                 percpu 0                                                                            
                 sock 0                                                                              
                 shmem 0                                                                             
                 file_mapped 135168                                                                  
                 file_dirty 295342080                                                                
                 file_writeback 0                                                                    
                 anon_thp 0                                                                          
                 inactive_anon 16760832                                                              
                 active_anon 0                                                                                                                                                                             
                 inactive_file 179277824                                                             
                 active_file 115965952                                                                                                                                                                     
                 unevictable 0                                                                       
                 slab_reclaimable 655664                                                             
                 slab_unreclaimable 0                                                                
                 slab 655664                                                                         
                 workingset_refault_anon 0                                                           
                 workingset_refault_file 7194                                                        
                 workingset_activate_anon 0                                                          
                 workingset_activate_file 0                                                          
                 workingset_restore_anon 0                                                           
                 workingset_restore_file 0                                                           
                 workingset_nodereclaim 125499                                                       
                 pgfault 761112                                                                      
                 pgmajfault 0                                                                        
                 pgrefill 2668220                                                                    
                 pgscan 15364605                                                                     
                 pgsteal 8869680                                                                     
                 pgactivate 4764903                                                                  
                 pgdeactivate 2668220                                                                
                 pglazyfree 0                                                                        
                 pglazyfreed 0                                                                       
                 thp_fault_alloc 0                                                                   
                 thp_collapse_alloc 0                                                                
[5531327.506454] Tasks state (memory values in pages):    
[5531327.506455] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name  
[5531327.506459] [  27241]     0 27241    35271     4433   323584        0             1 zdb                                                                                                               
[5531327.506461] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=***,mems_allowed=0-1,oom_memcg=/zdb/***,task_memcg=/zdb/476c
2c73-5ac3-4f8e-a07a-2ff21d4f8113,task=zdb,pid=27241,uid=0                              
[5531327.506475] Memory cgroup out of memory: Killed process 27241 (zdb) total-vm:141084kB, anon-rss:16708kB, file-rss:1024kB, shmem-rss:0kB, UID:0 pgtables:316kB oom_score_adj:1                         
[5531327.508936] oom_reaper: reaped process 27241 (zdb), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Jan 18 '22 12:01 OmarElawady

I also think the buf/cache is freed if needed, because they are just used by the kernel to improve performance of file access. But if needed, data will be flushed to disk and then allocated to the process. Not the other way around.

Jan 18 '22 12:01 muhamadazmy

Some things to investigate:

Does that still happen if you reduce datasize to let's say 32M ? By default, datafile are split on 256M.
Can you try development-v2-hashmem branch, which not allocate a static 128M buffer

Jan 18 '22 13:01 maxux

hashmem also OOMs, zdb process restarts occasionally using 32M as datasize but not because of OOM (no dmesg logs, and container logs are truncated, segfault?).

Not reproducible with me locally at all, started the zos vm on an hdd and the usage (rss + cache) always oscillates a little bit behind the limit just like in the nodes' case but it never gets killed.

Stracing the original zdb version process while it's operating to check what it's doing at the end, causes it to not be killed by OOM!

Tried to make another image which pushes its logs to some files to prevent it from being truncated. The store command fulfillment rate became much slower and the processes piled up causing the qsfs container zstor to be OOMed, but the zdb container didn't misbehave at all (except for being slow?). Chuncking the writes to cope with the slow rate didn't result in any OOM or crash.

OOM logs with stacktrace and registers (for hashmem):

[6643776.606669] zdb invoked oom-killer: gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=1
[6643776.606675] CPU: 10 PID: 1891 Comm: zdb Not tainted 5.10.55-Zero-OS #2
[6643776.606678] Call Trace:
[6643776.606690]  dump_stack+0x6d/0x88
[6643776.606696]  dump_header+0x4a/0x220
[6643776.606699]  oom_kill_process+0xe4/0x140
[6643776.606703]  out_of_memory+0x113/0x520
[6643776.606708]  mem_cgroup_out_of_memory+0xdf/0x100
[6643776.606712]  try_charge+0x737/0x770
[6643776.606715]  mem_cgroup_charge+0x85/0x280
[6643776.606719]  __add_to_page_cache_locked+0x3bf/0x440
[6643776.606725]  ? scan_shadow_nodes+0x30/0x30
[6643776.606727]  add_to_page_cache_lru+0x4a/0xc0
[6643776.606730]  pagecache_get_page+0xff/0x2d0
[6643776.606778]  ? btrfs_csum_bytes_to_leaves+0x24/0x60 [btrfs]
[6643776.606805]  prepare_pages+0xf5/0x240 [btrfs]
[6643776.606831]  btrfs_buffered_write.isra.33+0x231/0x700 [btrfs]
[6643776.606855]  btrfs_file_write_iter+0x4ac/0x650 [btrfs]
[6643776.606861]  new_sync_write+0x11f/0x1b0
[6643776.606865]  vfs_write+0x1fd/0x260
[6643776.606868]  ksys_write+0xa1/0xe0
[6643776.606871]  ? __x64_sys_gettimeofday+0x6e/0xc0
[6643776.606874]  do_syscall_64+0x33/0x40
[6643776.606876]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[6643776.606880] RIP: 0033:0x4843c1
[6643776.606885] Code: Unable to access opcode bytes at RIP 0x484397.
[6643776.606887] RSP: 002b:00007ffe99a807b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[6643776.606891] RAX: ffffffffffffffda RBX: 00000000016f6544 RCX: 00000000004843c1
[6643776.606892] RDX: 00000000001f1111 RSI: 0000000001f38bc0 RDI: 0000000000000014
[6643776.606894] RBP: 00007ffe99a807f0 R08: 0000000000000000 R09: 0000000000000029
[6643776.606895] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000041b820
[6643776.606897] R13: 0000000000000000 R14: 0000000000722018 R15: 0000000000000000
[6643776.606901] memory: usage 308224kB, limit 308224kB, failcnt 1921778
[6643776.606902] memory+swap: usage 308224kB, limit 9007199254740988kB, failcnt 0
[6643776.606904] kmem: usage 3820kB, limit 9007199254740988kB, failcnt 0
[6643776.606905] Memory cgroup stats for /zdb/***:
[6643776.606922] anon 13246464
                 file 297775104
                 kernel_stack 49152
                 percpu 0
                 sock 0
                 shmem 0
                 file_mapped 135168
                 file_dirty 295071744
                 file_writeback 270336
                 anon_thp 0
                 inactive_anon 13516800
                 active_anon 0
                 inactive_file 180531200
                 active_file 116031488
                 unevictable 0
                 slab_reclaimable 3411016
                 slab_unreclaimable 0
                 slab 3411016
                 workingset_refault_anon 0
                 workingset_refault_file 363
                 workingset_activate_anon 0
                 workingset_activate_file 0
                 workingset_restore_anon 0
                 workingset_restore_file 0
                 workingset_nodereclaim 56628
                 pgfault 1969704
                 pgmajfault 0
                 pgrefill 2307585
                 pgscan 10195449
                 pgsteal 5283649
                 pgactivate 3173676
                 pgdeactivate 2307166
                 pglazyfree 0
                 pglazyfreed 0
                 thp_fault_alloc 0
                 thp_collapse_alloc 0
[6643776.606924] Tasks state (memory values in pages):
[6643776.606925] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[6643776.606928] [   1891]     0  1891     4009     3338    61440        0             1 zdb
[6643776.606930] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=***,mems_allowed=0-1,oom_memcg=/zdb/***,task_memcg=/zdb/***,task=zdb,pid=1891,uid=0
[6643776.606944] Memory cgroup out of memory: Killed process 1891 (zdb) total-vm:16036kB, anon-rss:13348kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:60kB oom_score_adj:1
[6643776.609605] oom_reaper: reaped process 1891 (zdb), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

If I understand correctly from the logs, it failed because of a write syscall (with 2MB data?), and the total-vm indicates the process part is just ~16MB.

Jan 31 '22 07:01 OmarElawady

Thanks !

It's possible to get segfault on hashmem, it's still in testing stage. Did you tried to reproduce it with cgroup memory limits ?

There is a specific sync call on file rotation. I don't know if the stacktrace talk about a write or a sync, it's btrfs related. I don't know what's exactly take in account for computing memory usage, if storage cache (eg: btrfs cache) is computed as well, that will be a problem and we need some btrfs locally to compare too.

Jan 31 '22 07:01 maxux

The zdb restarting occasionally is the latest with 32MB (not sure why as the logs are truncated).

The memory limit is set on the node eventually using cgroups. But wasn't able to reproduce outside the node with:

starting zdb inside a docker with a memory limit and doing the same qsfs setup on top of it.
making a btrfs (truncate -> mkfs -> mount inside the host -> mount to the container data dir) on ssd and hdd, and retrying.
a zos vm on top of hdd (it uses btrfs by default)

I guessed the syscall from table. ORIG_RAX equals to 1.

Edited the oom logs with a missing line at the beginning containing gfp_mask, might be useful.

Jan 31 '22 08:01 OmarElawady