stat optimization at initialization stage
📝 Use Case Description
Hello!
I’m using fio to benchmark network filesystems (e.g., NFS, CIFS) with huge filesets (e.g., 1,600,000 files). While the benchmarking itself runs fine, I’m seeing that the initialization stage takes much longer—often 10x longer than the actual test run.
🔍 Profiling Details
A quick profiling session indicates that most of the time is spent in get_file_type(), which calls stat() for each file to determine the file type:
💡 Current Workaround
For my specific case (large, regular file sets), I patched the code by hardcoding the file type:
int add_file(struct thread_data *td, const char *fname, int numjob, int inc)
{
...
// get_file_type(f);
f->filetype = FIO_TYPE_FILE;
...
}
This drastically speeds up the initialization phase for my use case.
⸻
🙋♂️ Question
Would it make sense to add a parameter (e.g., --assume-regular-files) that tells fio to skip the file type check (get_file_type) and assume all files are regular?
This could help users working with large regular file sets on network filesystems avoid unnecessary stat() calls.
Hi @struschev,
Phew that's a whole lot of files to have on a network filesystem and I can see how the stat overhead is painful, so yes we would love to see a patch that added an option to help that workload. Some thoughts below:
- If you choose to submit a patch don't forget to follow https://github.com/axboe/fio/blob/master/.github/PULL_REQUEST_TEMPLATE.md (although you can put <> around your email address ;)
- Could you make it a option that takes a string like
--file-type=fileor--file-type=blocketc. to match the types that someone may wish to specify to save doing astat()? See https://github.com/axboe/fio/blob/fio-3.40/options.c#L2720-L2762 for an example of a option that is a choice. If the user uses your option to force an incorrect type then I say they get to keep both pieces when things break... - We would also need documentation for it in the HOWTO and man page perhaps in the "Target file/device" section?
- I don't know what to do about the fact that multiple files can be specified "at the same time" (see https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-filename and the colon seperator). I don't think we can stuff any more into the filename and people still need to be able to set something when they don't specify a
filenameand are using options likenrfilesso another option seems like a good balance.
What do you think?
- Ok, I will prepare a patch and other stuff.
- Although I doubt that anyone will need to test such a multitude of block/char devices, it doesn't cost me anything to make a more flexible option.
- I propose just ignoring the new option when the
filenameis specified.
I propose just ignoring the new option when the filename is specified.
Counter-proposal - your new option acts on all files of the job regardless of how they were defined. If the user doesn't use filename all the template name files are impacted by your option. If the user sets filename, all the files specified within in it are impacted by your option.
I guess it opens the question of how do you to handle conflicts? For example:
name=first
filetype=file
filename=mynetworkfsfile
name=second
filename=mynetworkfsfile
Do we stat on mynetworkfsfile because the second job didn't bother to set a filetype?
I've just been looking through fio options use underscores rather than hyphens or they just run the words together so counter to my previous suggestion I'd recommend the new option be called file_type or filetype.
Hello, @sitsofe ! I've finally found the time to create a PR, and now I'm waiting for your feedback. Thanks
@struschev I'm pleased to see the PR - thanks for taking the time to put it together. I've left some review comments.