fio icon indicating copy to clipboard operation
fio copied to clipboard

Continuous increasing memory consumption for FIO when using a verify job

Open chamarthy opened this issue 5 years ago • 7 comments

We are facing a issue where while running the job, the memory consumption of FIO keeps increasing and at some point we hit OOM on Linux CentOS 7.x. We were trying to write IO to a LUN of 20T with compression enabled. We have observed the same issue with VMs with any memory configuration.

# fio --version
fio-3.13-22-gd9c50

JOB:

[global]
ioengine=libaio
exitall_on_error=1
invalidate=1
direct=1 
allow_file_create=0
refill_buffers=1
bs=8k
rw=randrw
rwmixread=50
rwmixwrite=50
group_reporting
verify=crc32c
do_verify=1
verify_fatal=1
verify_dump=1
iodepth_batch_submit=2
iodepth_low=16
iodepth=32

[mpatha-20T]
filename=/dev/mapper/mpatha
size=20T
buffer_pattern=0x44e5bbac
buffer_compress_percentage=70
buffer_compress_chunk=3k

PMAP:

# while true; do pmap -x 11139 | tail -1; sleep 5; done
total kB         1270484  596572  596168
total kB         1273652  599684  599280
total kB         1276952  602944  602540
total kB         1280912  606904  606500
total kB         1284080  610136  609732
total kB         1287248  613316  612912
total kB         1290680  616672  616268
total kB         1294244  620228  619824
total kB         1297544  623624  623220
total kB         1300976  627048  626644
total kB         1304276  630352  629948
total kB         1307708  633684  633280
total kB         1311140  637140  636736
total kB         1314704  640688  640284
total kB         1317872  643976  643572
total kB         1321436  647472  647068
total kB         1324868  650876  650472
total kB         1328168  654212  653808
total kB         1331600  657576  657172
total kB         1334900  660940  660536
total kB         1338596  664576  664172
total kB         1342028  668084  667680
total kB         1345460  671532  671128
total kB         1348760  674788  674384
total kB         1352588  678596  678192
total kB         1356152  682152  681748

TOP:

top - 05:38:30 up 22 min,  4 users,  load average: 0.96, 0.83, 0.63
Tasks: 282 total,   3 running, 279 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.0 us, 18.4 sy,  0.0 ni, 57.7 id,  0.0 wa,  0.0 hi, 13.8 si,  0.0 st
KiB Mem :  1863224 total,    66852 free,  1230260 used,   566112 buff/cache
KiB Swap:  1679356 total,  1345020 free,   334336 used.    31024 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 11139 root      20   0 1466764 792628    468 R 19.9 42.5   2:12.17 fio
 11330 root      20   0       0      0      0 S  6.6  0.0   0:09.77 kworker/u256:2

chamarthy avatar Mar 21 '19 09:03 chamarthy

@chamarthy Great report! Can you try the experimental_verify option?

sitsofe avatar Mar 21 '19 09:03 sitsofe

will verify and let you know.

chamarthy avatar Mar 22 '19 07:03 chamarthy

@chamarthy any news?

sitsofe avatar Apr 19 '19 09:04 sitsofe

The first run, without experimental_verify grew to around 16GB memory usage on a 3TB file (I canceled it with 9 minutes estimated left). The run with experimental_verify is still running (1h 24m remaining), but so far I'm not seeing any growth in memory usage, with fio using just around 51MB.

bcran avatar May 06 '19 19:05 bcran

The run with experimental_verify finished, and I didn't see the memory usage go above 51MB.

bcran avatar May 06 '19 21:05 bcran

OK the results reported by @bcran were kind of what we would expect. The non-experimental verify actually extends a data structure with I/Os to verify whereas experimental just regenerates what is required at verification time. Off the top of my head I can only see experimental going wrong when you have the same I/O colliding with another I/O for the same region while both are in-flight and I'd guess that could only happens if you aren't using a randommap or at wraparound time.

sitsofe avatar May 30 '19 07:05 sitsofe

I just encounter same issue, this actually can be avoid by using "verify_backlog" option.

AndCycle avatar Mar 23 '22 09:03 AndCycle