fio icon indicating copy to clipboard operation
fio copied to clipboard

fio could not choose GPU with respect to gpu_dev_ids in case of libcufile/cufile for gpudirect rdma

Open lvjing421 opened this issue 6 months ago • 4 comments

Please acknowledge the following before creating a ticket

  • [YES] I have read the GitHub issues section of REPORTING-BUGS.

Description of the bug: gpu_dev_ids setting is invalid in that fio process is running on all gpus (no problem with cuda_io=posix), where file.fio is as following - /mnt/nfs is mounted with proto rdma in that two machines are connected over rdma link.

[global] ioengine=libcufile directory=/mnt/nfs gpu_dev_ids=7 cuda_io=cufile direct=1 bs=4K size=500M

Environment: ubuntu20.04 with kernel 5.15.0-58-generic, cpu intel, gpu nvidia A100 with open driver 555.42.02 and cuda 12.5, gds 1.10.0.4, nvidia_fs 2.20.5, libcufile 2.12

fio version: 3.40

Reproduction steps In fio/configure line 2752, add the following code to make ./configure --enable-cuda --enable-libcufile work. CFLAGS="$CFLAGS -I/usr/local/cuda-12.5/targets/x86_64-linux/include" LDFLAGS="$LDFLAGS -L/usr/local/cuda-12.5/targets/x86_64-linux/lib"

  1. under fio, mkdir build
  2. cd build & ../configure --enable-cuda --enable-libcufile
  3. make -j 12
  4. ./fio ../example/file.fio

lvjing421 avatar Jun 19 '25 07:06 lvjing421

Hello @lvjing421 :

You didn't include the output that fio produces when you try to use the above - can you include it as text (please no screenshot images) in a markdown wrapper (so GitHub knows to format it as a block).

Also just to be clear: is the issue is that when you use the libcufile ioengine and set gpu_dev_ids with your above jobfile fio produces an error? Further, if you keep everything the same apart from removing gpu_dev_ids then everything works?

(For anyone who finds this the same question over on https://github.com/axboe/fio/discussions/1922#discussioncomment-13502914 )

sitsofe avatar Jun 19 '25 08:06 sitsofe

Hi @sitsofe , there is no error in the output, but when you check with nvidia-smi, you would see fio process all over the gpus.

lvjing421 avatar Jun 20 '25 01:06 lvjing421

@lvjing421 :

What about this question:

Further, if you keep everything the same apart from removing gpu_dev_ids then everything works?

Also if the directory is a local one rather than a remote one you also get the same outcome?

sitsofe avatar Jun 20 '25 05:06 sitsofe

"Further, if you keep everything the same apart from removing gpu_dev_ids then everything works?" - yes, in this case fio runs over all gpus.

"Also if the directory is a local one rather than a remote one you also get the same outcome?" - currrently there is no satisfying local nvme for test.

lvjing421 avatar Jun 20 '25 09:06 lvjing421