fio icon indicating copy to clipboard operation
fio copied to clipboard

Issue with size and offset_increment Parameter Interaction

Open cosikng opened this issue 5 months ago • 6 comments

Hi team, I’m currently using fio version 3.40 for performance testing, and I’ve encountered some unexpected behavior when using the size and offset_increment parameters together. I couldn’t find an explanation in the documentation, so I’d appreciate some help.

Here’s the command I’m using:

fio --name=seq-write \
    --ioengine=libpmem \
    --direct=1 \
    --sync=1 \
    --bs=4096 \
    --filesize=2G \
    --size=$((2/2))G \
    --numjobs=2 \
    --offset_increment=1G \
    --cpus_allowed_policy=split \
    --thread \
    --rw=write \
    --filename=/mnt/pmem0/fiofile \
    --cpus_allowed=0-27

My goal is to have the two threads each write to half of a 2GB file: one to the first 1GB, and the other to the second 1GB. Based on the documentation, offset_increment seems like the correct parameter for this.

However, when I first ran the command, I got this output:

Run status group 0 (all jobs):
  WRITE: bw=2107MiB/s (2209MB/s), 2107MiB/s-2107MiB/s (2209MB/s-2209MB/s), io=1024MiB (1074MB), run=486-486msec

It shows only 1GB of total I/O, and strangely, only one job seems to have been executed—even though I didn’t include group_reporting, so I expected output for both jobs.

When I ran the exact same command again, I got the expected result:

Run status group 0 (all jobs):
  WRITE: bw=4719MiB/s (4948MB/s), 2359MiB/s-2359MiB/s (2474MB/s-2474MB/s), io=2048MiB (2147MB), run=434-434msec

This time, I saw output for both jobs and a total I/O of 2GB.

I also noticed that if I manually create the file beforehand, I consistently get correct 2GB I/O results. So it seems the issue only occurs when fio creates the file for the first time.

This behavior seems incorrect to me. Why is only one thread/job running the first time the file is created, resulting in just 1GB of I/O? Is there a configuration I missed? I couldn’t find anything in the documentation to explain this. I also tried running with the --debug=fio,file option, but the output was extremely verbose and I wasn’t able to extract any useful information from it.

Thanks in advance for your help!

cosikng avatar Aug 05 '25 13:08 cosikng

Hi @cosikng:

This does indeed sound strange... Can you reproduce the problem:

  • With the minimum number of options? For example can you remove cpus_allowed_policy, cpus_allowed, sync and still make the problem happen?
  • Does the problem happen every time when the file isn't present?
  • Can you make the problem happen with ioengines other than libpmem?
  • Can you reduce the amount of I/O you're doing (e.g. filesize=64k, size=32k, offset_increment=32k and still make the problem happen?

If you're able to make it happen with less I/O it may be worth attaching the (debug) output as a text file for further investigation.

sitsofe avatar Aug 05 '25 15:08 sitsofe

Hi, @sitsofe As you suggested, I removed flags like direct, sync, and cpus_allowed, and also reduced the access size. The updated test command is:

fio --name=seq-write \
    --ioengine=libpmem \
    --bs=64 \
    --filesize=64k \
    --size=32k \
    --numjobs=2 \
    --offset_increment=32k \
    --thread \
    --rw=write \
    --filename=/mnt/pmem1/bugtest

However, the issue still consistently appears. On the first run, the output is:

Run status group 0 (all jobs):
  WRITE: bw=31.2MiB/s (32.8MB/s), 31.2MiB/s-31.2MiB/s (32.8MB/s-32.8MB/s), io=32.0KiB (32.8kB), run=1-1msec

And on the second run, I get:

Run status group 0 (all jobs):
  WRITE: bw=31.2MiB/s (32.8MB/s), 15.6MiB/s-15.6MiB/s (16.4MB/s-16.4MB/s), io=64.0KiB (65.5kB), run=2-2msec

I also tried other I/O engines, including psync and posixaio, using the same parameters, and did not observe this issue with them.

Thanks again for looking into this. Please let me know if there’s any other information I can provide.

cosikng avatar Aug 05 '25 15:08 cosikng

@cosikng: OK let's go to the extreme: filesize=128 size=32 offset_increment=32. If that still reproduces the issue add --debug=all before --name, redirect the output to a file and then attach the file to this ticket.

sitsofe avatar Aug 05 '25 19:08 sitsofe

@sitsofe As you suggested, I tried these combinations but still got the unexpected result. Here’s the command I used:

fio --debug=all \
    --name=seq-write \
    --ioengine=libpmem \
    --bs=16 \
    --filesize=128 \
    --size=32 \
    --numjobs=2 \
    --offset_increment=32 \
    --thread \
    --rw=write \
    --filename=/mnt/pmem1/bugtest

Below are the outputs. The suffixes 1 and 2 indicate the results from the first and second runs, respectively:

report1.txt report2.txt

cosikng avatar Aug 06 '25 02:08 cosikng

@cosikng I've looked through your logs and it confirms what you are seeing. Testing locally (with a kernel booted with memmap=1G!4G on its command line to create a /dev/pmem0 device and then running mkdir -p /mnt/pmem0; mount -o dax /dev/pmem0 /mnt/pmem0/) showed the same problem. I've cut the problem command line to the following:

$ rm -f /mnt/pmem0/fio.tmp
$ ./fio --ioengine=libpmem --filesize=32 --size=16 --bs=16 --offset=16 --filename=/mnt/pmem0/fio.tmp --rw=write --name=offsetbug
offsetbug: (g=0): rw=write, bs=(R) 16B-16B, (W) 16B-16B, (T) 16B-16B, ioengine=libpmem, iodepth=1
fio-3.40-50-gb1b0-dirty
Starting 1 thread
offsetbug: Prepopulating IO file (/mnt/pmem0/fio.tmp)


Run status group 0 (all jobs):
$

The problem is more obvious if you look at the size of the file that fio left over:

$ du -b /mnt/pmem0/fio.tmp 
16	/mnt/pmem0/fio.tmp

So the file is half the size of what I would have expected. I think this then interacts with the pmem ioengine ~~(possibly because [it can't extend a file with its writes]~~ [turns out the difference is because the libpmem ioengine is FIO_DISKLESSIO] (https://github.com/axboe/fio/blob/b1b07c8dfbb562a949afd127d693e9c0cb009827/engines/libpmem.c#L237C54-L237C66):

[...]
	.flags		= FIO_SYNCIO | FIO_RAWIO | FIO_DISKLESSIO | FIO_NOEXTEND |
[...]

)

Other ioengines (like sync) don't care that the file is too small because they just extend the file ~~when~~ before they start doing their offseted writes. When you run fio with an existing file that is too small it correctly works out the file needs to be bigger at layout time and extends it before trying to do a write.

@vincentkfu Do you think this investigation is correct?

sitsofe avatar Aug 06 '25 21:08 sitsofe

For those following along at home, it looks like no I/O is done because the file size is initially taken from the size parameter in get_file_sizes() when the file doesn't already exist:

 843 static int get_file_sizes(struct thread_data *td)
 844 {          
[...]
 849         for_each_file(td, f, i) {
[...]
 853                 if (td_io_get_file_size(td, f)) {
[...]
 860                 }
 861 
 862                 /*
 863                  * There are corner cases where we end up with -1 for
 864                  * ->real_file_size due to unsupported file type, etc.
 865                  * We then just set to size option value divided by number
 866                  * of files, similar to the way file ->io_size is set.
 867                  * stat(2) failure doesn't set ->real_file_size to -1.
 868                  */
 869                 if (f->real_file_size == -1ULL && td->o.size)
 870                         f->real_file_size = td->o.size / td->o.nr_files;

Then because libpmem ioengine is diskless the file is not set as needing extending in setup_files():

1078 int setup_files(struct thread_data *td)
1079 {
[...]
1167         /*
1168          * now file sizes are known, so we can set ->io_size. if size= is
1169          * not given, ->io_size is just equal to ->real_file_size. if size
1170          * is given, ->io_size is size / nr_files.
1171          */
1172         extend_size = total_size = 0;
1173         need_extend = 0;
1174         for_each_file(td, f, i) {
1175                 f->file_offset = get_start_offset(td, f);
[...]
1257                 if (f->filetype == FIO_TYPE_FILE &&
1258                     (f->io_size + f->file_offset) > f->real_file_size) {
1259                         if (!td_ioengine_flagged(td, FIO_DISKLESSIO) &&
1260                             !o->create_on_open) {
1261                                 need_extend++;
1262                                 extend_size += (f->io_size + f->file_offset);
1263                                 fio_file_set_extend(f);
[...]
1300         /*
1301          * See if we need to extend some files, typically needed when our
1302          * target regular files don't exist yet, but our jobs require them
1303          * initially due to read I/Os.
1304          */
1305         if (need_extend) {
[...]
1317                 for_each_file(td, f, i) {
1318                         unsigned long long old_len = -1ULL, extend_len = -1ULL;
1319 
1320                         if (!fio_file_extend(f))
1321                                 continue;
1322 
1323                         assert(f->filetype == FIO_TYPE_FILE);
1324                         fio_file_clear_extend(f);
1325                         if (!o->fill_device) {
1326                                 old_len = f->real_file_size;
1327                                 extend_len = f->io_size + f->file_offset -
1328                                                 old_len;
1329                         }
1330                         f->real_file_size = (f->io_size + f->file_offset);
1331                         err = extend_file(td, f);

Finally when it comes to time to generate the next I/O offset it is found we are already beyond "the size of the file we calculated at setup time" in get_next_seq_offset():

 346 static int get_next_seq_offset(struct thread_data *td, struct fio_file *f,
 347                                enum fio_ddir ddir, uint64_t *offset)
 348 {
[...]
 374         if (f->last_pos[ddir] < f->real_file_size) {
[....]
 414         }
 415 
 416         return 1;
 417 }

Plot twist: when running the job

./fio --ioengine=libpmem --filesize=32 --size=16 --bs=16 --offset_increment=16 --filename=/mnt/pmem0/fio.tmp --rw=write --numjobs=2 --name=offsetincrementbug

The f->io_size of the first (offset 0) job will be 32 and f->io_size of the second (offset 16) job will be 16. f->io_size is set in setup_files():

1078 int setup_files(struct thread_data *td)
{
[...]
1174         for_each_file(td, f, i) {
[...]
1214                 } else if (f->real_file_size < o->file_size_low ||
1215                            f->real_file_size > o->file_size_high) {
1216                         if (f->file_offset > o->file_size_low)
1217                                 goto err_offset;
1218                         /*
1219                          * file size given. if it's fixed, use that. if it's a
1220                          * range, generate a random size in-between.
1221                          */
1222                         if (o->file_size_low == o->file_size_high)
1223                                 f->io_size = o->file_size_low - f->file_offset;

when the file is opened in fio_libpmem_open_file(), f->io_size is passed as the length.

124 static int fio_libpmem_open_file(struct thread_data *td, struct fio_file *f)
125 {
126         struct fio_libpmem_data *fdd;
[...]
142         fdd->libpmem_sz = f->io_size;
143         fdd->libpmem_off = 0;
144 
145         return fio_libpmem_file(td, f, fdd->libpmem_sz, fdd->libpmem_off);

and in fio_libpmem_file() the file is mapped using pmem_map_file() with the PMEM_FILE_CREATE flag:

 86 static int fio_libpmem_file(struct thread_data *td, struct fio_file *f,
 87                             size_t length, off_t off)
[...]
108         if((fdd->libpmem_ptr = pmem_map_file(f->file_name, length, PMEM_FILE_CREATE, mode, &mapped_l    en, &is_pmem)) == NULL) {

and the pmem_map_file(3) man page says this:

[...] PMEM_FILE_CREATE - Create the file named path if it does not exist. len must be non-zero and specifies the size of the file to be created. If the file already exists, it will be extended or truncated to len. [emphasis added] [...]

so the file is grown to 32 bytes but fio never knew anything about it because all its calculations were cached before fio_libpmem_file() grew the file. Subsequent invocations of fio don't have to create the file and retreive its on disk size and thus are successful.

sitsofe avatar Aug 08 '25 16:08 sitsofe