fio could not choose GPU with respect to gpu_dev_ids in case of libcufile/cufile for gpudirect rdma
Please acknowledge the following before creating a ticket
- [YES] I have read the GitHub issues section of REPORTING-BUGS.
Description of the bug: gpu_dev_ids setting is invalid in that fio process is running on all gpus (no problem with cuda_io=posix), where file.fio is as following - /mnt/nfs is mounted with proto rdma in that two machines are connected over rdma link.
[global] ioengine=libcufile directory=/mnt/nfs gpu_dev_ids=7 cuda_io=cufile direct=1 bs=4K size=500M
Environment: ubuntu20.04 with kernel 5.15.0-58-generic, cpu intel, gpu nvidia A100 with open driver 555.42.02 and cuda 12.5, gds 1.10.0.4, nvidia_fs 2.20.5, libcufile 2.12
fio version: 3.40
Reproduction steps In fio/configure line 2752, add the following code to make ./configure --enable-cuda --enable-libcufile work. CFLAGS="$CFLAGS -I/usr/local/cuda-12.5/targets/x86_64-linux/include" LDFLAGS="$LDFLAGS -L/usr/local/cuda-12.5/targets/x86_64-linux/lib"
- under fio, mkdir build
- cd build & ../configure --enable-cuda --enable-libcufile
- make -j 12
- ./fio ../example/file.fio
Hello @lvjing421 :
You didn't include the output that fio produces when you try to use the above - can you include it as text (please no screenshot images) in a markdown wrapper (so GitHub knows to format it as a block).
Also just to be clear: is the issue is that when you use the libcufile ioengine and set gpu_dev_ids with your above jobfile fio produces an error? Further, if you keep everything the same apart from removing gpu_dev_ids then everything works?
(For anyone who finds this the same question over on https://github.com/axboe/fio/discussions/1922#discussioncomment-13502914 )
Hi @sitsofe , there is no error in the output, but when you check with nvidia-smi, you would see fio process all over the gpus.
@lvjing421 :
What about this question:
Further, if you keep everything the same apart from removing
gpu_dev_idsthen everything works?
Also if the directory is a local one rather than a remote one you also get the same outcome?
"Further, if you keep everything the same apart from removing gpu_dev_ids then everything works?" - yes, in this case fio runs over all gpus.
"Also if the directory is a local one rather than a remote one you also get the same outcome?" - currrently there is no satisfying local nvme for test.